Introduction¶
This project aims to build robust, interpretable machine learning models to distinguish hypoxic from normoxic states in cancer cell lines based on high-dimensional gene expression data. We focus on two cell lines — HCC1806 and MCF7 — profiled using two RNA sequencing methods: Smart-seq, a high-sensitivity, full-length transcript approach, and Drop-seq, a high-throughput, lower-sensitivity alternative.
Our pipeline includes extensive preprocessing and downstream analysis of normalized expression matrices.
Unsupervised methods are used to uncover intrinsic structure, including clustering (Hierarchical, Leiden, k-means) and dimensionality reduction (PCA, t-SNE, UMAP) to identify patterns across oxygen conditions.
Supervised models — logistic regression, SVMs, random forests, and MLPs — are trained to classify samples by oxygen state and identify key features driving hypoxic responses.
Dataset Naming Convention¶
To keep datasets organized, we use the format:
Components:¶
platform:
ss= Smart-seqds= Drop-seq
cell:
mcf7hcc(for HCC1806)
stage:
raw= unfilteredfilt= filterednorm= filtered + normalized
Examples:¶
| Description | Variable Name |
|---|---|
| Smart-seq unfiltered MCF7 | ss_mcf7_raw |
| Smart-seq filtered MCF7 | ss_mcf7_filt |
| Smart-seq filtered + normalized MCF7 | ss_mcf7_norm |
| Smart-seq unfiltered HCC1806 | ss_hcc_raw |
| Smart-seq filtered + normalized HCC1806 | ss_hcc_norm |
| Drop-seq filtered MCF7 | ds_mcf7_filt |
| Drop-seq filtered + normalized MCF7 | ds_mcf7_norm |
| Drop-seq filtered HCC1806 | ds_hcc_filt |
| Drop-seq filtered + normalized HCC1806 | ds_hcc_norm |
Imports¶
# Standard library
import math
from itertools import combinations
from types import ModuleType
from typing import Any, Callable
# Third-party libraries
import anndata
import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.io as pio
import scanpy as sc
import seaborn as sns
# SciPy
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.stats import kurtosis, mode, skew
# Matplotlib
from matplotlib.patches import Patch
from matplotlib.ticker import FixedLocator, FixedFormatter
# scikit-learn
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.exceptions import ConvergenceWarning
from sklearn.feature_selection import RFECV, SelectFromModel, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import trustworthiness
from sklearn.metrics import (
accuracy_score,
adjusted_rand_score,
classification_report,
confusion_matrix,
normalized_mutual_info_score,
silhouette_samples,
silhouette_score,
)
from sklearn.model_selection import (
GridSearchCV,
RandomizedSearchCV,
cross_val_score,
learning_curve,
train_test_split,
)
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import LinearSVC, SVC
import warnings
warnings.resetwarnings()
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category = ConvergenceWarning)
Meta Data¶
# META DATA
ss_mcf7_meta = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
ss_mcf7_meta.head(5)
| Cell Line | Lane | Pos | Condition | Hours | Cell name | PreprocessingTag | ProcessingComments | |
|---|---|---|---|---|---|---|---|---|
| Filename | ||||||||
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A10 | Hypo | 72 | S28 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A11 | Hypo | 72 | S29 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A12 | Hypo | 72 | S30 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A1 | Norm | 72 | S1 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A2 | Norm | 72 | S2 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
Here we see what information the name of each cell contains, which will be useful later - especially the condition (Hypo/Norm) which is what we ultimately want to predict.
Unfiltered SmartSeq MCF7¶
Exploration¶
In this initial exploration step, we load the unfiltered Smart-Seq file for the MCF7 cell line and examine its dimensions and gene identifiers, as well as inspect basic data quality metrics. Specifically, we:
- read in the raw counts table (genes × cells)
- print the overall shape to see how many genes and cells we have
- see the first few rows to verify the per-cell expression values
- use .describe() to summarize distributions across cells
- check for any missing values
This quick scan gives us confidence that the data are loaded correctly and sets the stage for filtering, normalization, and more detailed analysis.
ss_mcf7_raw = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Unfiltered_Data.txt",delimiter=" ",engine='python',index_col=0)
gene_symbls = ss_mcf7_raw.index
print("Dataframe indexes: ", gene_symbls)
ss_mcf7_raw.shape
Dataframe indexes: Index(['WASH7P', 'MIR6859-1', 'WASH9P', 'OR4F29', 'MTND1P23', 'MTND2P28',
'MTCO1P12', 'MTCO2P12', 'MTATP8P1', 'MTATP6P1',
...
'MT-TH', 'MT-TS2', 'MT-TL2', 'MT-ND5', 'MT-ND6', 'MT-TE', 'MT-CYB',
'MT-TT', 'MT-TP', 'MAFIP'],
dtype='object', length=22934)
(22934, 383)
# How much of each gene (row) is in each cell (column)
ss_mcf7_raw.head(5)
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam | output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam | output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WASH7P | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| MIR6859-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WASH9P | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 1 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 4 | 5 |
| OR4F29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| MTND1P23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 383 columns
ss_mcf7_raw.describe()
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam | output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam | output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | ... | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 |
| mean | 40.817651 | 0.012253 | 86.442400 | 1.024636 | 14.531351 | 56.213613 | 75.397183 | 62.767725 | 67.396747 | 2.240734 | ... | 17.362562 | 42.080230 | 34.692422 | 32.735284 | 21.992718 | 17.439391 | 49.242784 | 61.545609 | 68.289352 | 62.851400 |
| std | 465.709940 | 0.207726 | 1036.572689 | 6.097362 | 123.800530 | 503.599145 | 430.471519 | 520.167576 | 459.689019 | 25.449630 | ... | 193.153757 | 256.775704 | 679.960908 | 300.291051 | 153.441647 | 198.179666 | 359.337479 | 540.847355 | 636.892085 | 785.670341 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 17.000000 | 0.000000 | 5.000000 | 0.000000 | 7.000000 | 23.000000 | 39.000000 | 35.000000 | 38.000000 | 1.000000 | ... | 9.000000 | 30.000000 | 0.000000 | 17.000000 | 12.000000 | 9.000000 | 27.000000 | 30.000000 | 38.000000 | 33.000000 |
| max | 46744.000000 | 14.000000 | 82047.000000 | 289.000000 | 10582.000000 | 46856.000000 | 29534.000000 | 50972.000000 | 36236.000000 | 1707.000000 | ... | 17800.000000 | 23355.000000 | 81952.000000 | 29540.000000 | 12149.000000 | 19285.000000 | 28021.000000 | 40708.000000 | 46261.000000 | 68790.000000 |
8 rows × 383 columns
# MISSING VALUES
ss_mcf7_raw.isnull().values.any()
np.False_
Gene Counts¶
In this section we:
- add up all the reads in each cell to see how many genes we detect per sample
- make a bar chart (colored by hypoxia vs. normoxia) to spot any differences
- group samples by their ID letters and count how many hypoxic and normoxic cells are in each group
This helps us check if one condition has consistently more or fewer detected genes before we move on.
ss_mcf7_raw_small = ss_mcf7_raw.iloc[:, 150:220]
column_sums = ss_mcf7_raw_small.sum(axis=0)
column_sums_sorted = column_sums.sort_values(ascending=False)
sorted_labels = column_sums_sorted.index
clean_labels = sorted_labels.str.replace(r"output\.STAR\.", "", regex=True)
clean_labels = clean_labels.str.replace(r"_Aligned\.sortedByCoord\.out\.bam", "", regex=True)
colors = [
'royalblue' if 'Hypo' in label else
'seagreen' if 'Norm' in label else
'gray'
for label in clean_labels
]
plt.figure(figsize=(14,8))
plt.bar(clean_labels, column_sums_sorted.values, color=colors)
plt.xticks(rotation=90, fontsize=8)
plt.title('Total Number of Genes per Cell Type')
plt.xlabel('Cell Type')
plt.ylabel('Total Gene Count')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()
column_sums = ss_mcf7_raw.sum(axis=0)
column_sums_sorted = column_sums.sort_values(ascending=False)
sorted_labels = column_sums_sorted.index
clean_labels = sorted_labels.str.replace(r"output\.STAR\.", "", regex=True)
clean_labels = clean_labels.str.replace(r"_Aligned\.sortedByCoord\.out\.bam", "", regex=True)
# Extract letter from label (after first underscore)
first_letters = clean_labels.str.extract(r'_(\w)')[0]
print(first_letters)
from collections import defaultdict
# Initialize counters
group_counts = defaultdict(lambda: {'Hypo': 0, 'Norm': 0})
for idx, name in enumerate(clean_labels):
# Extract the letter after first underscore
letter = first_letters[idx]
# Check condition
if 'hypo' in name.lower():
group_counts[letter]['Hypo'] += 1
elif 'norm' in name.lower():
group_counts[letter]['Norm'] += 1
print(group_counts)
0 C
1 B
2 C
3 A
4 A
..
378 E
379 H
380 D
381 G
382 H
Name: 0, Length: 383, dtype: object
defaultdict(<function <lambda> at 0x2bbcb47c0>, {'C': {'Hypo': 24, 'Norm': 24}, 'B': {'Hypo': 24, 'Norm': 24}, 'A': {'Hypo': 24, 'Norm': 24}, 'E': {'Hypo': 24, 'Norm': 24}, 'D': {'Hypo': 24, 'Norm': 24}, 'F': {'Hypo': 24, 'Norm': 24}, 'G': {'Hypo': 24, 'Norm': 24}, 'H': {'Hypo': 23, 'Norm': 24}})
We conclude that classes are balanced and this is also true across the groups (ID letters).
Outliers¶
Q1 = ss_mcf7_raw.quantile(0.25)
Q3 = ss_mcf7_raw.quantile(0.75)
IQR = Q3 - Q1
# Keep only the rows that have no ouliers
ss_mcf7_raw_noOut = ss_mcf7_raw[~((ss_mcf7_raw < (Q1 - 1.5 * IQR)) | (ss_mcf7_raw > (Q3 + 1.5 * IQR))).any(axis=1)]
ss_mcf7_raw_noOut.shape
(6435, 383)
IQR method removes 22934 - 6435 = 16,499 rows, which is roughly 72% of our data => not valid, our data is too sparse for this approach.
Quality Control & Violin Plots¶
In this section, we:
- Compute per-cell QC metrics:
- total counts
- number of genes detected
- percent mitochondrial reads (MT-genes as a fraction of total)
- percent zeros (dropouts)
- visualize distributions with histograms and violin plots to spot outliers or skewed distributions
- filter out low-quality cells using intuitive thresholds (e.g., <2,000 genes, <100,000 reads, >10% mitochondrial), then re-plot the post-filter distributions to confirm that most remaining cells lie within acceptable ranges
# Create QC DataFrame
qc_ss_mcf7 = pd.DataFrame(index=ss_mcf7_raw.columns)
# Total counts
qc_ss_mcf7['total_counts'] = ss_mcf7_raw.sum(axis=0)
print("\nComputed total_counts per cell.")
print(qc_ss_mcf7['total_counts'].describe())
# Number of genes detected per cell
qc_ss_mcf7['n_genes'] = (ss_mcf7_raw > 0).sum(axis=0)
print("\nComputed n_genes per cell.")
print(qc_ss_mcf7['n_genes'].describe())
# Mitochondrial genes
mito_genes = [gene for gene in ss_mcf7_raw.index if gene.startswith("MT-") or gene.startswith("MT.")]
print(f"\nIdentified {len(mito_genes)} mitochondrial genes.")
# % Mitochondrial expression
qc_ss_mcf7['pct_mito'] = ss_mcf7_raw.loc[mito_genes].sum(axis=0) / qc_ss_mcf7['total_counts'] * 100
print("\nComputed percent mitochondrial gene expression per cell.")
print(qc_ss_mcf7['pct_mito'].describe())
# Percentage of Zeros per Sample
qc_ss_mcf7['percent_zeros'] = (ss_mcf7_raw == 0).sum(axis=0) / ss_mcf7_raw.shape[0] * 100
Computed total_counts per cell. count 3.830000e+02 mean 9.946119e+05 std 5.503732e+05 min 1.000000e+00 25% 5.987505e+05 50% 1.129334e+06 75% 1.408638e+06 max 2.308057e+06 Name: total_counts, dtype: float64 Computed n_genes per cell. count 383.000000 mean 9124.219321 std 2693.309249 min 1.000000 25% 8456.500000 50% 9907.000000 75% 10789.000000 max 12519.000000 Name: n_genes, dtype: float64 Identified 36 mitochondrial genes. Computed percent mitochondrial gene expression per cell. count 383.000000 mean 1.911659 std 2.355400 min 0.000000 25% 0.740893 50% 1.528072 75% 2.597771 max 31.033833 Name: pct_mito, dtype: float64
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
axs[0].hist(qc_ss_mcf7['total_counts'], bins=30, color='gray')
axs[0].set_title("Total Counts per Sample")
axs[0].set_xlabel("Total Counts")
axs[1].hist(qc_ss_mcf7['n_genes'], bins=30, color='steelblue')
axs[1].set_title("Number of Genes per Sample")
axs[1].set_xlabel("Genes Detected")
axs[2].hist(qc_ss_mcf7['percent_zeros'], bins=30, color='darkred')
axs[2].set_title("% Zeros per Sample")
axs[2].set_xlabel("Percent Zeros")
plt.tight_layout()
plt.show()
We see a long tail of low-count cells — those below ~100,000 reads will be removed.
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sns.violinplot(y=qc_ss_mcf7['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")
sns.violinplot(y=qc_ss_mcf7['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")
sns.violinplot(y=qc_ss_mcf7['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")
plt.tight_layout()
plt.show()
Most cells cluster around 5,000–10,000 detected genes, but a few drop below 2,000.
# QC Scatter Plot
adata = ad.AnnData(X=ss_mcf7_raw.T)
adata.obs['total_counts'] = qc_ss_mcf7['total_counts']
adata.obs['n_genes_by_counts'] = qc_ss_mcf7['n_genes']
adata.obs['pct_counts_mt'] = qc_ss_mcf7['pct_mito']
sc.pl.scatter(
adata,
x="total_counts",
y="n_genes_by_counts",
color="pct_counts_mt"
)
This scatter plot visualizes key quality metrics for each cell:
- X-axis: Total number of transcripts detected per cell (
total_counts) - Y-axis: Number of unique genes detected per cell (
n_genes_by_counts) - Color: Proportion of reads mapping to mitochondrial genes (
pct_counts_mt), a known marker of cell stress or apoptosis.
Most cells show a healthy profile with:
- High gene detection
- Moderate transcript counts
- Low mitochondrial content (dark colors)
However, a few outliers have:
- Low gene counts
- High mitochondrial percentages (bright yellow points)
These may represent low-quality or dying cells and are typically filtered out in preprocessing to improve downstream analyses.
High mitochondrial gene expression in a cell usually indicates poor quality, often because the cell was:
- Stressed
- Dying or partially lysed
- Degraded
Now we filter the data:
min_genes = 2_000 # Cells with very low gene counts (< 2000) should be filtered out
min_counts = 100_000 # Cells with extremely low counts may be low-quality
max_mito = 10 # A common threshold is 5%-10% to flag high-mito cells
high_quality_cells = qc_ss_mcf7[
(qc_ss_mcf7['n_genes'] > min_genes) &
(qc_ss_mcf7['total_counts'] > min_counts) &
(qc_ss_mcf7['pct_mito'] < max_mito)
]
# Retain only the high-quality columns (cells)
ss_mcf7_raw_filt = ss_mcf7_raw[high_quality_cells.index]
print(f"Original: {ss_mcf7_raw.shape[1]} cells")
print(f"Filtered: {ss_mcf7_raw_filt.shape[1]} cells")
Original: 383 cells Filtered: 337 cells
Filtering removed 46 cells. Let's see how the violin and QC scatter plots look now.
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sns.violinplot(y=high_quality_cells['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")
sns.violinplot(y=high_quality_cells['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")
sns.violinplot(y=high_quality_cells['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")
plt.tight_layout()
plt.show()
After filtering, the distributions tighten, indicating that outlier cells have been successfully removed.
adata = ad.AnnData(X=ss_mcf7_raw_filt.T)
adata.obs['total_counts'] = high_quality_cells['total_counts']
adata.obs['n_genes_by_counts'] = high_quality_cells['n_genes']
adata.obs['pct_counts_mt'] = high_quality_cells['pct_mito']
sc.pl.violin(
adata,
["total_counts", "n_genes_by_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True
)
sc.pl.scatter(
adata,
x="total_counts",
y="n_genes_by_counts",
color="pct_counts_mt"
)
The scatter plot confirms that we have removed cells with extremely low gene counts, low total counts and high mitochondrial content (the heatmap on the rigt is on a much lower scale).
Duplicates¶
First, we remove genes (rows) that have zero expression across all cells. These genes contain no information and contribute neither to biological signal nor technical variation. Keeping them would only increase dimensionality and computational load without adding value.
# Check original number of genes (rows)
original_rows = ss_mcf7_raw_filt.shape[0]
# Drop genes with all-zero expression
ss_mcf7_raw_filt = ss_mcf7_raw_filt.loc[~(ss_mcf7_raw_filt == 0).all(axis=1)]
# Check number of rows after dropping
remaining_rows = ss_mcf7_raw_filt.shape[0]
# Compute how many were dropped
dropped_rows = original_rows - remaining_rows
print(f"Number of all-zero rows dropped: {dropped_rows}")
Number of all-zero rows dropped: 36
Next, we remove duplicate genes — that is, genes that have identical expression profiles across all cells.
This can happen due to:
- redundant gene IDs
- dummy genes or technical artifacts
- perfectly zeroed-out rows (common in sparse data)
We drop all but the first occurrence of each set of duplicate rows:
duplicate_rows = ss_mcf7_raw_filt[ss_mcf7_raw_filt.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows.shape[0])
print("Rows before:", ss_mcf7_raw_filt.shape[0])
ss_mcf7_raw_filt = ss_mcf7_raw_filt.drop_duplicates()
print("Rows after :", ss_mcf7_raw_filt.shape[0])
number of duplicate rows: 98 Rows before: 22898 Rows after : 22843
We do a quick check to make sure that there are no cells (columns) with zero expression across all genes:
zero_cols = (ss_mcf7_raw_filt == 0).all(axis=0)
print(f"Number of all-zero columns: {zero_cols.sum()}")
Number of all-zero columns: 0
Skeweness & Kurtosis¶
In this “Skewness and Kurtosis” step, we check how lopsided and heavy-tailed our per-cell expression profiles are, both before and after a simple log2 transformation:
- skewness tells us if a cell’s expression values lean more to one side (positive skew means a long right tail; negative skew means a long left tail)
- kurtosis measures how heavy those tails are (high kurtosis means more extreme outliers)
from scipy.stats import kurtosis, skew
colN = np.shape(ss_mcf7_raw_filt)[1]
colN
df_skew_cells = []
cnames = ss_mcf7_raw_filt.columns
for i in range(colN) :
v_df = ss_mcf7_raw_filt[cnames[i]]
df_skew_cells += [skew(v_df)]
# df_skew_cells += [df[cnames[i]].skew()]
df_skew_cells
sns.histplot(df_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - ss_mcf7_raw_filt')
Text(0.5, 0, 'Skewness of single cells expression profiles - ss_mcf7_raw_filt')
Here we see that most cells have skewness values between 40 and 70, with a peak around 50–60. This indicates that the expression distributions are strongly right-skewed, which is expected in single-cell RNA-seq data due to many low-expression genes and a few highly expressed ones. The consistent skewness across cells reflects the sparse nature of the data, though extremely high or low skewness values may indicate outliers or technical artifacts.
df_kurt_cells = []
for i in range(colN) :
v_df = ss_mcf7_raw_filt[cnames[i]]
df_kurt_cells += [kurtosis(v_df)]
df_kurt_cells
sns.histplot(df_kurt_cells,bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - ss_mcf7_raw_filt')
Text(0.5, 0, 'Kurtosis of single cells expression profiles - ss_mcf7_raw_filt')
- The kurtosis distribution is right-skewed, with a long tail toward higher kurtosis values.
- Most cells fall within the 2,000–6,000 kurtosis range.
- A few cells show extremely high kurtosis (>10,000), which are potential outliers.
- The distribution is highly non-normal.
To reduce skew and heavy tails, we apply a log2(x+1) transform.
# DATA TRANSFORMATION
ss_mcf7_raw_filt_log = np.log2(ss_mcf7_raw_filt + 1) # genes × cells
ss_mcf7_raw_filt_T = ss_mcf7_raw_filt.T
ss_mcf7_raw_filt_T_log = np.log2(ss_mcf7_raw_filt_T + 1) # cells × genes (transpose necessary for skew() and kurtosis())
# Skeweness and Kurtosis should be fixed now
print("Before data transformation:", skew(ss_mcf7_raw_filt.T.values.flatten()), kurtosis(ss_mcf7_raw_filt.T.values.flatten()))
print("After data transformation:", skew(ss_mcf7_raw_filt_T_log.values.flatten()), kurtosis(ss_mcf7_raw_filt_T_log.values.flatten()))
Before data transformation: 85.61749272454507 11944.74886564649 After data transformation: 0.9824438044182301 -0.3223454127671732
colN = ss_mcf7_raw_filt_log.shape[1]
cnames = ss_mcf7_raw_filt_log.columns
# Compute skeweness for each cell
df_skew_cells_log = []
for i in range(colN):
v_df = ss_mcf7_raw_filt_log[cnames[i]]
df_skew_cells_log.append(skew(v_df))
# Plot histogram
sns.histplot(df_skew_cells_log, bins=100)
plt.xlabel('Skewness of single cells expression profiles (log2 transformed)')
plt.title('Distribution of Skewness (log2-transformed MCF7)')
plt.tight_layout()
plt.show()
Post‐transform, the skewness distribution tightens around zero, indicating a more symmetric profile.
# Compute kurtosis for each cell
df_kurt_cells_log = []
for i in range(colN):
v_df = ss_mcf7_raw_filt_log[cnames[i]]
df_kurt_cells_log.append(kurtosis(v_df))
# Plot histogram of kurtosis
sns.histplot(df_kurt_cells_log, bins=100)
plt.xlabel('Kurtosis of single cells expression profiles (log2-transformed)')
plt.title('Distribution of Kurtosis - log2(ss_mcf7_raw_filt + 1)')
plt.tight_layout()
plt.show()
And the kurtosis drops towards a normal range, meaning fewer extreme outliers remain.
Train a linear classifier in PCA space¶
In this section, we use Principal Component Analysis (PCA) to compress our filtered, log-transformed MCF7 expression profiles into their top axes of variation, then train a simple logistic-regression model directly on those axes to distinguish hypoxic from normoxic samples. By projecting into 2D and 3D PCA space, we can visually assess how well the two conditions separate, and by fitting a linear classifier we quantify how much of that separation is captured by a single decision plane.
# 1. Transpose the DataFrame so that rows = samples, columns = genes
ss_mcf7_raw_filt_T = ss_mcf7_raw_filt.T # transpose necessary for pca.fit_transform()
# 2. Log-transform if data isn't already normalized
ss_mcf7_raw_filt_T_log = np.log2(ss_mcf7_raw_filt_T + 1)
# 3. Generate colors based on column/sample names
colors = ['royalblue' if 'hypo' in name.lower() else 'seagreen' for name in ss_mcf7_raw_filt_T_log.index]
# 4. Run PCA with 2 components
pca = PCA(n_components=2)
pc = pca.fit_transform(ss_mcf7_raw_filt_T_log)
# 5. Plot
plt.figure(figsize=(8,6))
plt.scatter(pc[:,0], pc[:,1], c=colors)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.title("PCA of Samples (Colored by Condition)")
plt.grid(True)
legend_elements = [
Patch(facecolor='seagreen', label='Normoxia'),
Patch(facecolor='royalblue', label='Hypoxia')
]
plt.legend(handles=legend_elements, title='Condition')
plt.tight_layout()
plt.show()
The 2D projection shows that hypoxic (blue) and normoxic (green) samples form distinct clusters along PC1 and PC2.
# Run PCA with 3 components
pca = PCA(n_components=3)
pc = pca.fit_transform(ss_mcf7_raw_filt_T_log)
pio.renderers.default = 'browser'
# Extract labels from sample names
labels = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in ss_mcf7_raw_filt_T_log.index]
# Get indices for each condition
hypo_idx = np.array(labels) == 'Hypo'
norm_idx = np.array(labels) == 'Norm'
fig = go.Figure()
# Hypo samples
fig.add_trace(go.Scatter3d(
x=pc[hypo_idx, 0],
y=pc[hypo_idx, 1],
z=pc[hypo_idx, 2],
mode='markers',
name='Hypo',
marker=dict(color='royalblue', size=6),
text=ss_mcf7_raw_filt_T_log.index[hypo_idx],
hoverinfo='text'
))
# Norm samples
fig.add_trace(go.Scatter3d(
x=pc[norm_idx, 0],
y=pc[norm_idx, 1],
z=pc[norm_idx, 2],
mode='markers',
name='Norm',
marker=dict(color='seagreen', size=6),
text=ss_mcf7_raw_filt_T_log.index[norm_idx],
hoverinfo='text'
))
# Set layout
fig.update_layout(
scene=dict(
xaxis_title=f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)',
yaxis_title=f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)',
zaxis_title=f'PC3 ({pca.explained_variance_ratio_[2]*100:.1f}%)'
),
title='3D PCA of Samples (Interactive)',
margin=dict(l=0, r=0, b=0, t=40)
)
fig.show()
This is an interacive plot opening on the browser. In 3D PCA space, the two conditions appear even more separable — this suggests the linear boundary may achieve high classification accuracy. Next we fit a logistic regression on the three PCA coordinates to find the optimal separating plane between hypoxic and normoxic samples.
labels = [1 if 'hypo' in name.lower() else 0 for name in ss_mcf7_raw_filt_T_log.index]
clf = LogisticRegression()
clf.fit(pc, labels)
# Extract coefficients (normal vector to the plane)
w = clf.coef_[0] # [w1, w2, w3]
b = clf.intercept_[0]
# Create grid to cover the PCA space
x_range = np.linspace(pc[:, 0].min(), pc[:, 0].max(), 10)
y_range = np.linspace(pc[:, 1].min(), pc[:, 1].max(), 10)
xx, yy = np.meshgrid(x_range, y_range)
# Compute corresponding z for the plane
zz = (-w[0] * xx - w[1] * yy - b) / w[2]
# Compute decision values for all points
decision_values = np.dot(pc, w) + b
# Predicted labels: 1 if value > 0 (Hypo), else 0 (Norm)
predicted_labels = (decision_values > 0).astype(int)
The coefficient vector ‘w‘ and intercept ‘b‘ define our decision plane in PCA space.
labels = np.array([1 if 'hypo' in name.lower() else 0 for name in ss_mcf7_raw_filt_T_log.index])
accuracy = (predicted_labels == labels).mean()
print(f"Accuracy of plane: {accuracy * 100:.2f}%")
Accuracy of plane: 99.41%
Result: Our linear classifier achieves ~99.41% accuracy, confirming that PCA projection retains the key signal distinguishing hypoxia from normoxia. This suggests that the Smart-seq MCF7 data is highly separable and well-suited for supervised learning.
Unfiltered SmartSeq HCC1806¶
Exploration¶
In this first step for the HCC1806 line, similarly to previous cell line, we:
- load the unfiltered Smart-Seq expression matrix and grab the gene symbols as row labels
- check the overall shape to see how many genes and samples we have
- view the first few rows to confirm the per-cell read counts look sensible
- summarize basic statistics with .describe() to inspect ranges, means, and quartiles
- verify there are no missing values that could interfere with our analysis
This quick check ensures that the HCC1806 data are correctly loaded and free of major issues before we move on to filtering, normalization, and deeper quality control.
ss_hcc_raw = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Unfiltered_Data.txt",delimiter=" ",engine='python',index_col=0)
gene_symbls = ss_hcc_raw.index
print("Dataframe indexes: ", gene_symbls)
ss_hcc_raw.shape
Dataframe indexes: Index(['WASH7P', 'CICP27', 'DDX11L17', 'WASH9P', 'OR4F29', 'MTND1P23',
'MTND2P28', 'MTCO1P12', 'MTCO2P12', 'MTATP8P1',
...
'MT-TH', 'MT-TS2', 'MT-TL2', 'MT-ND5', 'MT-ND6', 'MT-TE', 'MT-CYB',
'MT-TT', 'MT-TP', 'MAFIP'],
dtype='object', length=23396)
(23396, 243)
ss_hcc_raw.head(5)
| output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam | ... | output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WASH7P | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CICP27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| DDX11L17 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WASH9P | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| OR4F29 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 243 columns
ss_hcc_raw.describe()
| output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam | ... | output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | ... | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 |
| mean | 99.565695 | 207.678278 | 9.694734 | 150.689007 | 35.700504 | 47.088434 | 152.799453 | 135.869422 | 38.363908 | 45.512139 | ... | 76.361771 | 105.566593 | 54.026116 | 29.763806 | 28.905411 | 104.740725 | 35.181569 | 108.197940 | 37.279962 | 76.303855 |
| std | 529.532443 | 981.107905 | 65.546050 | 976.936548 | 205.885369 | 545.367706 | 864.974182 | 870.729740 | 265.062493 | 366.704721 | ... | 346.659348 | 536.881574 | 344.068304 | 186.721266 | 135.474736 | 444.773045 | 170.872090 | 589.082268 | 181.398951 | 369.090274 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 51.000000 | 125.000000 | 5.000000 | 40.000000 | 22.000000 | 17.000000 | 81.000000 | 76.000000 | 22.000000 | 18.000000 | ... | 56.000000 | 67.000000 | 29.000000 | 18.000000 | 19.000000 | 76.000000 | 24.000000 | 68.000000 | 22.000000 | 44.000000 |
| max | 35477.000000 | 69068.000000 | 6351.000000 | 70206.000000 | 17326.000000 | 47442.000000 | 43081.000000 | 62813.000000 | 30240.000000 | 35450.000000 | ... | 19629.000000 | 30987.000000 | 21894.000000 | 13457.000000 | 11488.000000 | 33462.000000 | 15403.000000 | 34478.000000 | 10921.000000 | 28532.000000 |
8 rows × 243 columns
# MISSING VALUES
ss_hcc_raw.isnull().values.any()
np.False_
Gene Counts¶
In the “Gene Counts” step for HCC1806, we:
- sum the read counts in each cell
- clean up the sample names and color-code them by condition (hypoxia vs. normoxia)
- plot a bar chart of total gene counts per cell, so we can quickly spot if one condition systematically yields more or fewer detected genes
ss_hcc_raw_small = ss_hcc_raw.iloc[:, 150:220]
column_sums = ss_hcc_raw_small.sum(axis=0)
column_sums_sorted = column_sums.sort_values(ascending=False)
sorted_labels = column_sums_sorted.index
clean_labels = sorted_labels.str.replace(r"output\.STAR\.", "", regex=True)
clean_labels = clean_labels.str.replace(r"_Aligned\.sortedByCoord\.out\.bam", "", regex=True)
colors = [
'royalblue' if 'Hypo' in label else
'seagreen' if 'Norm' in label else
'gray'
for label in clean_labels
]
plt.figure(figsize=(14,8))
plt.bar(clean_labels, column_sums_sorted.values, color=colors)
plt.xticks(rotation=90, fontsize=8)
plt.title('Total Number of Genes per Cell Type')
plt.xlabel('Cell Type')
plt.ylabel('Total Gene Count')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()
A handful of samples show zero counts across all genes — these empty profiles should be filtered out as they won’t contribute any useful information.
Outliers¶
Q1 = ss_hcc_raw.quantile(0.25)
Q3 = ss_hcc_raw.quantile(0.75)
IQR = Q3 - Q1
# Keep only the rows that have no ouliers
ss_hcc_raw_noOut = ss_hcc_raw[~((ss_hcc_raw < (Q1 - 1.5 * IQR)) | (ss_hcc_raw > (Q3 + 1.5 * IQR))).any(axis=1)]
ss_hcc_raw_noOut.shape
(10815, 243)
IQR method removes 23396 - 10815 = 12581 rows, which is roughly 54% of our data => not valid, our data is too sparse for this approach
Quality Control & Violin Plots¶
In this section, we calculate and visualize key QC metrics for the HCC1806 single-cell data to identify and remove low-quality cells before further analysis. We:
- compute per-cell metrics: total read counts, number of genes detected, percent mitochondrial reads, and percent zeros
- plot distributions with histograms to flag outliers, then violin plots to compare density and spread across cells
- filter cells using intuitive thresholds (e.g. >2,000 genes, >100,000 reads, <10% mito), and re-plot the post-filter metrics
# Create QC DataFrame
qc_ss_hcc = pd.DataFrame(index=ss_hcc_raw.columns)
# Total counts
qc_ss_hcc['total_counts'] = ss_hcc_raw.sum(axis=0)
print("\nComputed total_counts per cell.")
print(qc_ss_hcc['total_counts'].describe())
# Number of genes detected per cell
qc_ss_hcc['n_genes'] = (ss_hcc_raw > 0).sum(axis=0)
print("\nComputed n_genes per cell.")
print(qc_ss_hcc['n_genes'].describe())
# Mitochondrial genes
mito_genes = [gene for gene in ss_hcc_raw.index if gene.startswith("MT-") or gene.startswith("MT.")]
print(f"\nIdentified {len(mito_genes)} mitochondrial genes.")
# % Mitochondrial expression
qc_ss_hcc['pct_mito'] = ss_hcc_raw.loc[mito_genes].sum(axis=0) / qc_ss_hcc['total_counts'] * 100
print("\nComputed percent mitochondrial gene expression per cell.")
print(qc_ss_hcc['pct_mito'].describe())
# Percentage of Zeros per Sample
qc_ss_hcc['percent_zeros'] = (ss_hcc_raw == 0).sum(axis=0) / ss_hcc_raw.shape[0] * 100
Computed total_counts per cell. count 2.430000e+02 mean 2.012306e+06 std 1.171858e+06 min 1.140000e+02 25% 9.910625e+05 50% 2.067645e+06 75% 2.925182e+06 max 5.758132e+06 Name: total_counts, dtype: float64 Computed n_genes per cell. count 243.000000 mean 10330.358025 std 2260.259356 min 35.000000 25% 10117.000000 50% 10831.000000 75% 11409.000000 max 13986.000000 Name: n_genes, dtype: float64 Identified 36 mitochondrial genes. Computed percent mitochondrial gene expression per cell. count 243.000000 mean 2.197282 std 3.173782 min 0.000000 25% 1.462458 50% 1.840945 75% 2.468950 max 49.215686 Name: pct_mito, dtype: float64
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
axs[0].hist(qc_ss_hcc['total_counts'], bins=30, color='gray')
axs[0].set_title("Total Counts per Sample")
axs[0].set_xlabel("Total Counts")
axs[1].hist(qc_ss_hcc['n_genes'], bins=30, color='steelblue')
axs[1].set_title("Number of Genes per Sample")
axs[1].set_xlabel("Genes Detected")
axs[2].hist(qc_ss_hcc['percent_zeros'], bins=30, color='darkred')
axs[2].set_title("% Zeros per Sample")
axs[2].set_xlabel("Percent Zeros")
plt.tight_layout()
plt.show()
These histograms show wide variation across cells, including some very low-count or high-zero samples that should be filtered out.
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sns.violinplot(y=qc_ss_hcc['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")
sns.violinplot(y=qc_ss_hcc['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")
sns.violinplot(y=qc_ss_hcc['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")
plt.tight_layout()
plt.show()
Violin plots reveal that most cells fall within reasonable ranges, but tails indicate a few outliers.
adata = ad.AnnData(X=ss_hcc_raw.T)
adata.obs['total_counts'] = qc_ss_hcc['total_counts']
adata.obs['n_genes_by_counts'] = qc_ss_hcc['n_genes']
adata.obs['pct_counts_mt'] = qc_ss_hcc['pct_mito']
sc.pl.scatter(
adata,
x="total_counts",
y="n_genes_by_counts",
color="pct_counts_mt"
)
We apply thresholds (>2,000 genes, >100,000 reads, <10% mito) to keep only high-quality cells.
min_genes = 2_000 # Cells with very low gene counts (< 1000) should be filtered out
min_counts = 100_000 # Cells with extremely low counts may be low-quality
max_mito = 10 # A common threshold is 5%-10% to flag high-mito cells
high_quality_cells = qc_ss_hcc[
(qc_ss_hcc['n_genes'] > min_genes) &
(qc_ss_hcc['total_counts'] > min_counts) &
(qc_ss_hcc['pct_mito'] < max_mito)
]
# Retain only the high-quality columns (cells)
ss_hcc_raw_filt = ss_hcc_raw[high_quality_cells.index]
print(f"Original: {ss_hcc_raw.shape[1]} cells")
print(f"Filtered: {ss_hcc_raw_filt.shape[1]} cells")
Original: 243 cells Filtered: 233 cells
We filtered out 10 cells and now we inspect the plots again.
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sns.violinplot(y=high_quality_cells['total_counts'], ax=axes[0])
axes[0].set_title("Total Counts per Cell")
sns.violinplot(y=high_quality_cells['n_genes'], ax=axes[1])
axes[1].set_title("Number of Genes per Cell")
sns.violinplot(y=high_quality_cells['pct_mito'], ax=axes[2])
axes[2].set_title("Mitochondrial Gene %")
plt.tight_layout()
plt.show()
After filtering, distributions tighten and outliers are removed, leaving a more homogeneous set of cells.
adata = ad.AnnData(X=ss_hcc_raw_filt.T)
adata.obs['total_counts'] = high_quality_cells['total_counts']
adata.obs['n_genes_by_counts'] = high_quality_cells['n_genes']
adata.obs['pct_counts_mt'] = high_quality_cells['pct_mito']
sc.pl.violin(
adata,
["total_counts", "n_genes_by_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True
)
sc.pl.scatter(
adata,
x="total_counts",
y="n_genes_by_counts",
color="pct_counts_mt"
)
The violin and QC scatter plots confirm that filtering has been successful.
Duplicates¶
First we check for genes with all-zero expression profiles:
# Check original number of genes (rows)
original_rows = ss_hcc_raw_filt.shape[0]
# Drop genes with all-zero expression
ss_hcc_raw_filt = ss_hcc_raw_filt.loc[~(ss_hcc_raw_filt == 0).all(axis=1)]
# Check number of rows after dropping
remaining_rows = ss_hcc_raw_filt.shape[0]
# Compute how many were dropped
dropped_rows = original_rows - remaining_rows
print(f"Number of all-zero rows dropped: {dropped_rows}")
Number of all-zero rows dropped: 0
Smart-seq HCC1806 has no all-zero rows.
In this step, we scan the filtered HCC1806 matrix for any genes that have identical expression profiles across all cells — likely redundant entries from upstream processing — and then remove these duplicates. By reporting the row count before and after, we ensure our feature set contains only unique gene measurements.
duplicate_rows = ss_hcc_raw_filt[ss_hcc_raw_filt.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows.shape[0])
print("Rows before:", ss_hcc_raw_filt.shape[0])
ss_hcc_raw_filt = ss_hcc_raw_filt.drop_duplicates()
print("Rows after :", ss_hcc_raw_filt.shape[0])
number of duplicate rows: 92 Rows before: 23396 Rows after : 23339
Skeweness & Kurtosis¶
# DATA TRANSFORMATION
ss_hcc_raw_filt_log = np.log2(ss_hcc_raw_filt + 1) # genes × cells
ss_hcc_raw_filt_T = ss_hcc_raw_filt.T
ss_hcc_raw_filt_T_log = np.log2(ss_hcc_raw_filt_T + 1) # cells × genes
print("Before data transformation:", skew(ss_hcc_raw_filt.T.values.flatten()), kurtosis(ss_hcc_raw_filt.T.values.flatten()))
print("After data transformation:", skew(ss_hcc_raw_filt_T_log.values.flatten()), kurtosis(ss_hcc_raw_filt_T_log.values.flatten()))
Before data transformation: 66.87855806046676 10509.765195762657 After data transformation: 0.8280895490120331 -0.7178446753891188
After applying a log2 transformation to the expression matrix, the overall skewness (0.83) and kurtosis (−0.72) indicate that the distribution of gene expression values per cell is now approximately symmetric and less heavy-tailed. This transformation improves the suitability of the data for downstream analyses such as PCA and clustering.
Train a linear classifier in PCA space¶
Here, we repeat the PCA plus logistic regression work for the HCC1806 dataset. First, we project our filtered, log2-transformed gene expressions into their top principal components — visualizing in 2D and 3D to see whether hypoxic and normoxic samples separate naturally. Then we fit a simple logistic-regression model on the 3D coordinates to define a linear decision plane that predicts each cell’s condition, and finally report its classification accuracy.
# 1. Transpose the DataFrame so that rows = samples, columns = genes
ss_hcc_raw_filt_T = ss_hcc_raw_filt.T # transpose necessary for pca.fit_transform()
# 2. Log-transform if data isn't already normalized
ss_hcc_raw_filt_T_log = np.log2(ss_hcc_raw_filt_T + 1)
# 3. Generate colors based on column/sample names
colors = ['royalblue' if 'hypo' in name.lower() else 'seagreen' for name in ss_hcc_raw_filt_T_log.index]
# 4. Run PCA with 2 components
pca = PCA(n_components=2)
pc = pca.fit_transform(ss_hcc_raw_filt_T_log)
# 5. Plot
plt.figure(figsize=(8,6))
plt.scatter(pc[:,0], pc[:,1], c=colors)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.title("PCA of Samples (Colored by Condition)")
plt.grid(True)
plt.tight_layout()
plt.show()
The 2D PCA shows how much of the hypoxia vs. normoxia signal is captured by PC1 and PC2.
# Run PCA with 3 components
pca = PCA(n_components=3)
pc = pca.fit_transform(ss_hcc_raw_filt_T_log)
pio.renderers.default = 'browser'
# Extract labels from sample names
labels = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in ss_hcc_raw_filt_T_log.index]
# Get indices for each condition
hypo_idx = np.array(labels) == 'Hypo'
norm_idx = np.array(labels) == 'Norm'
fig = go.Figure()
# Hypo samples
fig.add_trace(go.Scatter3d(
x=pc[hypo_idx, 0],
y=pc[hypo_idx, 1],
z=pc[hypo_idx, 2],
mode='markers',
name='Hypo',
marker=dict(color='royalblue', size=6),
text=ss_hcc_raw_filt_T_log.index[hypo_idx],
hoverinfo='text'
))
# Norm samples
fig.add_trace(go.Scatter3d(
x=pc[norm_idx, 0],
y=pc[norm_idx, 1],
z=pc[norm_idx, 2],
mode='markers',
name='Norm',
marker=dict(color='seagreen', size=6),
text=ss_hcc_raw_filt_T_log.index[norm_idx],
hoverinfo='text'
))
# Set layout
fig.update_layout(
scene=dict(
xaxis_title=f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)',
yaxis_title=f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)',
zaxis_title=f'PC3 ({pca.explained_variance_ratio_[2]*100:.1f}%)'
),
title='3D PCA of Samples (Interactive)',
margin=dict(l=0, r=0, b=0, t=40)
)
fig.show()
This is an interacive plot opening in a browser. In 3D space, the two conditions form more distinct clusters, suggesting good separability. Next, we'll train a logistic-regression classifier on the three PCA axes to find the best separating plane.
labels = [1 if 'hypo' in name.lower() else 0 for name in ss_hcc_raw_filt_T_log.index]
clf = LogisticRegression()
clf.fit(pc, labels)
# Extract coefficients (normal vector to the plane)
w = clf.coef_[0] # [w1, w2, w3]
b = clf.intercept_[0]
# Create grid to cover the PCA space
x_range = np.linspace(pc[:, 0].min(), pc[:, 0].max(), 10)
y_range = np.linspace(pc[:, 1].min(), pc[:, 1].max(), 10)
xx, yy = np.meshgrid(x_range, y_range)
# Compute corresponding z for the plane
zz = (-w[0] * xx - w[1] * yy - b) / w[2]
# Compute decision values for all points
decision_values = np.dot(pc, w) + b
# Predicted labels: 1 if value > 0 (Hypo), else 0 (Norm)
predicted_labels = (decision_values > 0).astype(int)
The weights w and intercept b define our decision boundary; samples with positive decision values are classified as hypoxic.
labels = np.array([1 if 'hypo' in name.lower() else 0 for name in ss_hcc_raw_filt_T_log.index])
accuracy = (predicted_labels == labels).mean()
print(f"Accuracy of plane: {accuracy * 100:.2f}%")
Accuracy of plane: 90.13%
The classifier achieves ~90.13% accuracy, indicating that first 3 PCA components retain sufficient information to distinguish hypoxia from normoxia. This accuracy, however, is lower than for MCF7, suggesting that the Smart-seq HCC1806 data exhibits more subtle expression differences between conditions, making separation more challenging for linear models.
Differences between "...unfiltered...txt", "...filtered...txt", and "..normalized...txt" data¶
Data Overview¶
ss_mcf7_filt = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Filtered_Data.txt",delimiter=" ",engine='python',index_col=0)
ss_mcf7_norm = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)
ss_hcc_filt = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Filtered_Data.txt",delimiter=" ",engine='python',index_col=0)
ss_hcc_norm = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)
Here we load the filtered data (genes × cells) and the normalized training sets (top 3,000 genes × cells) for both cell lines.
ss_mcf7_filt.describe()
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam | output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam | output.STAR.1_B11_Hypo_S77_Aligned.sortedByCoord.out.bam | output.STAR.1_B12_Hypo_S78_Aligned.sortedByCoord.out.bam | output.STAR.1_B4_Norm_S52_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H10_Hypo_S382_Aligned.sortedByCoord.out.bam | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | ... | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 | 18945.000000 |
| mean | 49.409290 | 104.620639 | 17.589707 | 68.045395 | 91.260333 | 75.979784 | 81.576194 | 85.303985 | 49.655529 | 16.382792 | ... | 54.711375 | 21.016785 | 50.920137 | 39.622486 | 26.620164 | 21.099023 | 59.585537 | 74.487305 | 82.655054 | 76.081499 |
| std | 511.986757 | 1139.662971 | 136.014975 | 553.362211 | 472.099720 | 571.441098 | 504.632248 | 911.153373 | 406.561440 | 160.981562 | ... | 633.970615 | 212.338278 | 281.722199 | 329.984580 | 168.460343 | 217.871697 | 394.584632 | 594.260858 | 699.898130 | 863.857880 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 4.000000 | 6.000000 | 4.000000 | 1.000000 | 3.000000 | 0.000000 | ... | 0.000000 | 1.000000 | 4.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| 75% | 28.000000 | 27.000000 | 10.000000 | 36.000000 | 57.000000 | 52.000000 | 55.000000 | 48.000000 | 33.000000 | 9.000000 | ... | 28.000000 | 13.000000 | 42.000000 | 25.000000 | 17.000000 | 13.000000 | 42.000000 | 45.000000 | 55.000000 | 48.000000 |
| max | 46744.000000 | 82047.000000 | 10582.000000 | 46856.000000 | 29534.000000 | 50972.000000 | 36236.000000 | 56068.000000 | 24994.000000 | 13587.000000 | ... | 49147.000000 | 17800.000000 | 23355.000000 | 29540.000000 | 12149.000000 | 19285.000000 | 28021.000000 | 40708.000000 | 46261.000000 | 68790.000000 |
8 rows × 313 columns
ss_hcc_filt.describe()
| output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate1A9_Normoxia_S20_Aligned.sortedByCoord.out.bam | ... | output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam | output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | ... | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 | 19503.000000 |
| mean | 119.427883 | 249.107522 | 180.739527 | 42.818233 | 56.444393 | 183.264677 | 162.976670 | 46.014305 | 54.589961 | 96.803210 | ... | 91.583090 | 126.622930 | 64.801005 | 35.702302 | 34.670461 | 125.629544 | 42.195457 | 129.769010 | 44.715941 | 91.517561 |
| std | 577.934133 | 1069.768525 | 1067.470509 | 224.823960 | 596.882811 | 944.432350 | 951.367277 | 289.708746 | 401.024242 | 487.943421 | ... | 377.847391 | 585.760835 | 375.921207 | 203.991666 | 147.706909 | 484.448028 | 186.359651 | 643.033801 | 197.842998 | 402.529704 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.000000 | 7.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 4.000000 | 0.000000 | 0.000000 | 9.000000 | ... | 9.000000 | 4.000000 | 1.000000 | 2.000000 | 1.000000 | 15.000000 | 3.000000 | 4.000000 | 2.000000 | 8.000000 |
| 75% | 75.000000 | 179.000000 | 111.000000 | 31.000000 | 29.000000 | 126.000000 | 106.000000 | 32.000000 | 29.000000 | 78.000000 | ... | 77.000000 | 94.000000 | 42.000000 | 25.000000 | 27.000000 | 105.000000 | 34.000000 | 94.000000 | 32.000000 | 63.000000 |
| max | 35477.000000 | 69068.000000 | 70206.000000 | 17326.000000 | 47442.000000 | 43081.000000 | 62813.000000 | 30240.000000 | 35450.000000 | 42310.000000 | ... | 19629.000000 | 30987.000000 | 21894.000000 | 13457.000000 | 11488.000000 | 33462.000000 | 15403.000000 | 34478.000000 | 10921.000000 | 28532.000000 |
8 rows × 227 columns
ss_mcf7_norm.describe()
| output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam | output.STAR.2_B4_Norm_S58_Aligned.sortedByCoord.out.bam | output.STAR.2_B5_Norm_S59_Aligned.sortedByCoord.out.bam | output.STAR.2_B6_Norm_S60_Aligned.sortedByCoord.out.bam | output.STAR.2_B7_Hypo_S79_Aligned.sortedByCoord.out.bam | output.STAR.2_B9_Hypo_S81_Aligned.sortedByCoord.out.bam | output.STAR.2_C10_Hypo_S130_Aligned.sortedByCoord.out.bam | output.STAR.2_C11_Hypo_S131_Aligned.sortedByCoord.out.bam | output.STAR.2_C1_Norm_S103_Aligned.sortedByCoord.out.bam | output.STAR.2_C2_Norm_S104_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H10_Hypo_S382_Aligned.sortedByCoord.out.bam | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | ... | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 |
| mean | 74.140333 | 90.907000 | 99.089000 | 88.137000 | 110.395667 | 148.849000 | 126.422667 | 142.229667 | 91.781000 | 91.426333 | ... | 144.008333 | 133.846000 | 98.699333 | 84.070333 | 101.416333 | 96.636667 | 92.344333 | 154.387333 | 125.340000 | 132.017667 |
| std | 345.005307 | 409.560228 | 442.980702 | 425.804372 | 822.178446 | 1710.088769 | 1351.567001 | 1515.496440 | 388.660906 | 376.793214 | ... | 1349.125183 | 1242.320764 | 417.410827 | 406.100983 | 513.988262 | 499.224863 | 680.698856 | 1169.686762 | 1066.926126 | 1422.143351 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 24.000000 | 37.000000 | 33.000000 | 34.000000 | 38.250000 | 24.000000 | 13.000000 | 22.000000 | 37.000000 | 44.000000 | ... | 33.000000 | 38.000000 | 52.250000 | 25.000000 | 33.000000 | 44.000000 | 17.000000 | 19.000000 | 21.000000 | 20.250000 |
| max | 8222.000000 | 10167.000000 | 11446.000000 | 10312.000000 | 30586.000000 | 65037.000000 | 52680.000000 | 60789.000000 | 9394.000000 | 9077.000000 | ... | 56392.000000 | 50404.000000 | 11352.000000 | 8713.000000 | 17006.000000 | 16625.000000 | 29663.000000 | 34565.000000 | 34175.000000 | 57814.000000 |
8 rows × 250 columns
ss_mcf7_norm.head(5)
| output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam | output.STAR.2_B4_Norm_S58_Aligned.sortedByCoord.out.bam | output.STAR.2_B5_Norm_S59_Aligned.sortedByCoord.out.bam | output.STAR.2_B6_Norm_S60_Aligned.sortedByCoord.out.bam | output.STAR.2_B7_Hypo_S79_Aligned.sortedByCoord.out.bam | output.STAR.2_B9_Hypo_S81_Aligned.sortedByCoord.out.bam | output.STAR.2_C10_Hypo_S130_Aligned.sortedByCoord.out.bam | output.STAR.2_C11_Hypo_S131_Aligned.sortedByCoord.out.bam | output.STAR.2_C1_Norm_S103_Aligned.sortedByCoord.out.bam | output.STAR.2_C2_Norm_S104_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H10_Hypo_S382_Aligned.sortedByCoord.out.bam | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CYP1B1 | 343 | 131 | 452 | 27 | 5817 | 3841 | 9263 | 21543 | 1013 | 53 | ... | 7890 | 4512 | 160 | 351 | 327 | 196 | 504 | 34565 | 20024 | 5953 |
| CYP1B1-AS1 | 140 | 59 | 203 | 7 | 2669 | 1565 | 3866 | 9113 | 459 | 22 | ... | 3647 | 2035 | 75 | 138 | 130 | 102 | 238 | 13717 | 7835 | 2367 |
| CYP1A1 | 0 | 0 | 0 | 0 | 0 | 79 | 238 | 443 | 0 | 0 | ... | 86 | 1654 | 0 | 0 | 0 | 1 | 0 | 11274 | 563 | 522 |
| NDRG1 | 0 | 1 | 0 | 0 | 654 | 1263 | 2634 | 540 | 0 | 13 | ... | 481 | 1052 | 0 | 0 | 54 | 243 | 62 | 1263 | 925 | 1572 |
| DDIT4 | 386 | 289 | 0 | 288 | 2484 | 2596 | 1323 | 2044 | 36 | 204 | ... | 3692 | 2410 | 800 | 1 | 189 | 266 | 417 | 4256 | 12733 | 2275 |
5 rows × 250 columns
Looking at mean values, we immediately see that the normalized data has not been log-transformed. We will explore later whether further scaling or transformations are needed before training ML models on this data.
print("MCF7 raw:", ss_mcf7_raw.shape)
print("Filtered shape:", ss_mcf7_filt.shape)
print("Normalised shape:", ss_mcf7_norm.shape)
print("\nHCC1806 raw:", ss_hcc_raw.shape)
print("Filtered shape:", ss_hcc_filt.shape)
print("Normalised shape:", ss_hcc_norm.shape)
MCF7 raw: (22934, 383) Filtered shape: (18945, 313) Normalised shape: (3000, 250) HCC1806 raw: (23396, 243) Filtered shape: (19503, 227) Normalised shape: (3000, 182)
Comparing shapes: raw data have all genes and cells, filtered data remove low-quality features, and normalized data keep only 3,000 genes.
SmartSeq MCF7 Raw vs Filtered¶
In this section, we compare our own simple filtering rules to the provided Smart-Seq filtered dataset for MCF7. We’ll look at two aspects:
- Gene Filtering – identify how many genes our “expressed in >5 cells” rule removes versus the official filter, inspect the dropped genes’ expression and dispersion statistics, and check why they might have been excluded
- Cell Filtering – count how many cells we’d remove based on simple QC thresholds (total counts >250 k & genes >5 k), compare that to the provided filtered set, and see where any discrepancies lie
Gene Filtering¶
ss_mcf7_raw_filt.shape
(22843, 337)
genes_raw = set(ss_mcf7_raw.index)
genes_filtered = set(ss_mcf7_filt.index)
dropped_genes = genes_raw - genes_filtered
print(f"Genes dropped: {len(dropped_genes)}")
Genes dropped: 3989
Our filter removed significantly less genes (22934-22843=91) than the version provided to us (22934-18945=3989), so there must be additional criteria in the pipeline.
# Check expression stats for dropped genes
dropped_stats = ss_mcf7_raw.loc[list(dropped_genes)].sum(axis=1).describe()
print(dropped_stats)
count 3989.000000 mean 19.475307 std 43.194078 min 2.000000 25% 4.000000 50% 8.000000 75% 20.000000 max 1530.000000 dtype: float64
The dropped genes still show moderate total counts, so they aren’t all low-expressed “noise". Let's see what other filters could have been aplied.
# Genes that are expressed in more than 5 cells
ss_mcf7_raw_genes_mask = (ss_mcf7_raw > 0).sum(axis=1) > 5 # higher threshold results in less than 18945 genes remaining
ss_mcf7_raw_gene_set = set(ss_mcf7_raw.index[ss_mcf7_raw_genes_mask])
print(f"Genes passing our threshold: {len(ss_mcf7_raw_gene_set)}")
Genes passing our threshold: 19182
Only genes that are expressed in more than 5 cells were retained in ss_mcf7_filt:
ss_mcf7_raw_gene_set = ss_mcf7_raw.index[ss_mcf7_raw_genes_mask]
ss_mcf7_filt_gene_set = ss_mcf7_filt.index
overlap = ss_mcf7_raw_gene_set.intersection(ss_mcf7_filt_gene_set)
print(f"Overlap: {len(overlap)} / {len(ss_mcf7_filt_gene_set)}")
Overlap: 18945 / 18945
In our 19182 genes there are the filtered 18945 genes, but we still need to investigate why the remaining 237 genes were discarded.
# Convert Indexes to sets
ss_mcf7_raw_gene_set = set(ss_mcf7_raw_gene_set)
ss_mcf7_filt_gene_set = set(ss_mcf7_filt_gene_set)
# Find extra genes (present in the threshold but not in the filtered set)
extra_genes = ss_mcf7_raw_gene_set - ss_mcf7_filt_gene_set
extra_genes_list = list(extra_genes)
ss_mcf7_raw.loc[extra_genes_list].mean(axis=1).describe()
count 237.000000 mean 0.125183 std 0.160211 min 0.015666 25% 0.033943 50% 0.078329 75% 0.156658 max 1.360313 dtype: float64
# Expression counts for the remaining 237 genes
ss_mcf7_raw.loc[extra_genes_list].sum(axis=1).describe()
count 237.000000 mean 47.945148 std 61.360780 min 6.000000 25% 13.000000 50% 30.000000 75% 60.000000 max 521.000000 dtype: float64
subset = ss_mcf7_raw.loc[extra_genes_list]
gene_means = subset.mean(axis=1)
gene_vars = subset.var(axis=1)
gene_dispersion = gene_vars / gene_means.replace(0, np.nan)
print("\nVariance:")
print(gene_vars.describe())
print("\nDispersion (var / mean):")
print(gene_dispersion.describe())
Variance: count 237.000000 mean 6.859440 std 35.000200 min 0.015461 25% 0.105792 50% 0.637089 75% 2.725698 max 432.822714 dtype: float64 Dispersion (var / mean): count 237.000000 mean 17.180863 std 31.133525 min 0.984293 25% 2.979058 50% 8.507272 75% 18.379411 max 318.178694 dtype: float64
These genes are not low-variance “noise” genes:
- Dispersion values are well above common filtering thresholds (e.g., 0.5–1.0)
- Variance spans a wide range, including quite high (up to 432)
Next: Are they being dropped based on their identity (e.g. pseudogenes, mitochondrial, ribosomal)?
extra_genes_series = pd.Series(extra_genes_list)
# Check for known non-informative categories
is_mito = extra_genes_series.str.startswith("MT-")
is_ribo = extra_genes_series.str.startswith("RPL") | extra_genes_series.str.startswith("RPS")
is_pseudo = extra_genes_series.str.contains("-P")
is_mirna = extra_genes_series.str.contains("MIR")
is_linc = extra_genes_series.str.contains("LINC")
print(f"Mitochondrial: {is_mito.sum()}")
print(f"Ribosomal: {is_ribo.sum()}")
print(f"Pseudogenes (-P): {is_pseudo.sum()}")
print(f"miRNAs (MIR): {is_mirna.sum()}")
print(f"LINC: {is_linc.sum()}")
Mitochondrial: 0 Ribosomal: 13 Pseudogenes (-P): 0 miRNAs (MIR): 8 LINC: 13
Only ~14% fall into obvious categories (MT, RPL/RPS, etc.), so most dropped genes remain unexplained by standard filters.
print(extra_genes_list)
['SLC9C2', 'FEZF1-AS1', 'CALD1', 'KLRG1', 'ASAP1-IT2', 'GOLGA8R', 'LINC02137', 'MIR3143', 'DDR1-DT', 'LINC00661', 'PARP4P2', 'KCNC1', 'LINC00964', 'RPL12P33', 'CYP2F1', 'SLC34A3', 'TRAV27', 'SPOCK2', 'HSPB6', 'SNAP91', 'AMDHD1', 'FST', 'HMGN2P23', 'OR8B7P', 'HNRNPA1P35', 'MRRFP1', 'FNDC8', 'LINC02035', 'TNNC1', 'ARL9', 'ARHGEF18-AS1', 'SNTG1', 'XAGE2', 'SH3TC1', 'TAS2R43', 'TRPM3', 'LRRC43', 'P3H3', 'TLR9', 'IQGAP2', 'PRB2', 'STAB2', 'MAPK11', 'RPL7P36', 'ADM-DT', 'SOHLH2', 'QRICH2', 'C17orf50', 'TUSC8', 'LRIG3-DT', 'CELF3', 'RP1L1', 'IPCEF1', 'A2ML1', 'MSX2P1', 'HMGA1P3', 'PRSS30P', 'CD70', 'LINC00310', 'ARAP3', 'DDX10P2', 'MYO1A', 'AJAP1', 'RNA5SP18', 'MIR7161', 'SNORD3B-2', 'NDUFA3P1', 'CUX2', 'PKD1P2', 'HSPE1P6', 'HNRNPA3P9', 'RPL5P5', 'HMX3', 'CPB2-AS1', 'MXRA8', 'CELF4', 'MCTP2', 'RNA5SP392', 'PDPN', 'RPL23AP73', 'HSPE1P7', 'PPIAP51', 'SLC7A9', 'SNORA68B', 'CA5A', 'DNAJC8P1', 'PPARGC1A', 'OR2A7', 'LINC02169', 'TYRO3P', 'RPL32P2', 'HLA-S', 'RNA5SP440', 'HTR2B', 'AMPH', 'RALBP1P1', 'MIR4449', 'PSAT1', 'RNA5SP477', 'TRMT1P1', 'CCBE1', 'TAL2', 'UBE2CP2', 'LRAT', 'KLF1', 'OR7E7P', 'OR8B5P', 'MAMDC2-AS1', 'WIPF3', 'ABHD12B', 'GPR150', 'CHRNG', 'SLC7A2-IT1', 'MIR3188', 'ITGA1', 'MIR503', 'CCN3', 'LINC02895', 'LINC00572', 'RN7SL688P', 'HHIP', 'RPL23AP10', 'IGLV1-51', 'VAV1', 'SYT2', 'MIR2861', 'ADIPOQ', 'KRT37', 'CACNA1A', 'RNVU1-21', 'BACH1-AS1', 'KIF26B', 'RNU1-1', 'AQP8', 'RPS7P11', 'HOXB-AS1', 'C3orf35', 'ALOX15P1', 'NUTM2F', 'PLAT', 'IGF2-AS', 'NAPSB', 'TEX38', 'RPL7P52', 'LRRC15', 'ARG1', 'TSPEAR-AS2', 'FER1L6', 'PTMAP15', 'SEC14L5', 'TMEM47', 'LINC01224', 'WWC2-AS2', 'RRAD', 'DHX58', 'RPL21P34', 'NR1I2', 'LINC01863', 'PNMA8B', 'KLHL6', 'LCN2', 'DNAH12', 'CHMP4BP1', 'PIWIL2', 'DDC', 'TCF7L1-IT1', 'COLGALT2', 'HLA-DQA2', 'NOVA2', 'CPLX3', 'PCDHB5', 'CABP1', 'NRXN3', 'RBBP4P4', 'TRIM72', 'VTRNA1-1', 'ENPP3', 'SNORD98', 'MIR4271', 'TIGD4', 'OR4F21', 'HMGB1P26', 'EYS', 'VWA2', 'ITIH4', 'BMS1P3', 'FAM71F2', 'LINC02009', 'GAPDHP39', 'CES3', 'ANKLE1', 'PADI1', 'MRPL23-AS1', 'PEG10', 'THSD4-AS1', 'SHD', 'FTH1P1', 'PTCHD4', 'RN7SL605P', 'BATF2', 'AOC1', 'ABCC13', 'TSPEAR', 'MIR3175', 'TRIM17', 'RBP5', 'ELMOD1', 'LINC02014', 'BANK1', 'RPL26P6', 'RPL8P1', 'TMEM88', 'LARP1P1', 'H3P42', 'RPL4P2', 'TIMP4', 'FER1L6-AS2', 'FAM135A', 'ITGB7', 'RPL21P131', 'EPSTI1', 'IGLL3P', 'DPP3P1', 'GZMM', 'RN7SKP36', 'VILL', 'LINC01456', 'SPATA46', 'ANGPT4', 'S100A7', 'A2M', 'SNCAIP', 'DAND5', 'NFYBP1', 'EPO', 'SERPINA4', 'GOSR2-DT']
Let us see whether the removed genes are duplicates.
duplicate_rows = ss_mcf7_raw[ss_mcf7_raw.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows.shape[0])
# Convert to sets
duplicate_gene_set = set(duplicate_rows.index)
extra_removed_set = set(extra_genes_list)
# Intersect to find which of the 237 removed genes were duplicates
dup_overlap = extra_removed_set.intersection(duplicate_gene_set)
print(f"Removed genes that were also duplicates: {len(dup_overlap)}")
print("Example overlapping genes:", list(dup_overlap)[:10])
number of duplicate rows: 56 Removed genes that were also duplicates: 0 Example overlapping genes: []
The removed genes are not duplicates. The remaining ~200 are likely filtered by manual curation or a custom blacklist.
Cell Filtering¶
dropped_cells = ss_mcf7_raw.shape[1] - ss_mcf7_filt.shape[1]
print(f"Number of removed cells: {dropped_cells}")
Number of removed cells: 70
Now we examine some properties of the dropped cells to identify why they were discarded.
qc_cells = pd.DataFrame({
"total_counts": ss_mcf7_raw.sum(axis=0),
"n_genes": (ss_mcf7_raw > 0).sum(axis=0)
})
retained_cells = ss_mcf7_filt.columns
dropped_cells = ss_mcf7_raw.columns.difference(retained_cells)
qc_retained = qc_cells.loc[retained_cells]
qc_dropped = qc_cells.loc[dropped_cells]
print("Retained cells:")
print(qc_retained.describe())
print("\nDropped cells:")
print(qc_dropped.describe())
Retained cells:
total_counts n_genes
count 3.130000e+02 313.000000
mean 1.158035e+06 10046.894569
std 3.964047e+05 1258.586404
min 2.633690e+05 5358.000000
25% 8.801830e+05 9326.000000
50% 1.199119e+06 10242.000000
75% 1.460597e+06 10922.000000
max 1.982470e+06 12260.000000
Dropped cells:
total_counts n_genes
count 7.000000e+01 70.000000
mean 2.638760e+05 4998.542857
std 5.509890e+05 3444.854204
min 1.000000e+00 1.000000
25% 6.060000e+03 2196.250000
50% 7.184050e+04 5204.000000
75% 1.470945e+05 8015.750000
max 2.308057e+06 12519.000000
Dropped cells have lower mean total counts (by one order of magnitude) and a lower mean number of genes (10,000 vs 5000)
# Apply a candidate filter
# We only keep the genes that have more than 250,000 total counts and more than 5000 genes
cell_mask = (qc_cells['total_counts'] > 250_000) & (qc_cells['n_genes'] > 5000)
filtered_candidate = set(qc_cells.index[cell_mask])
# Cells in the original filtered dataset
original_filtered = set(ss_mcf7_filt.columns)
# Overlap
overlap = filtered_candidate.intersection(original_filtered)
print(f"Candidate filter retains: {len(filtered_candidate)} cells")
print(f"Overlap with ss_mcf7_filt: {len(overlap)} / {len(original_filtered)} ({len(overlap)/len(original_filtered)*100:.1f}%)")
Candidate filter retains: 320 cells Overlap with ss_mcf7_filt: 313 / 313 (100.0%)
Experimenting with multiple total counts and number of genes thresholds, we obtain that >250,000 and >5,000, respectively, are the best threshold-based filtering rules. Only 7 cells remain in our filtered dataset that are not present in ss_mcf7_filt and there is a perfect overlap for the other 313 cells. The remaining 7 cells were likely removed manually.
SmartSeq MCF7 Filtered vs Normalised + Filtered¶
In this section, we compare the filtered Smart-Seq matrix (all high-quality genes and cells) against the final normalized version that retains only 3,000 genes. We’ll walk through:
- exploration: How many genes and cells are lost during normalization, and how do per-cell totals and detection rates change?
- variance analysis: How does log-transform and normalization affect gene-wise variance and the choice of the top 3,000 genes?
- normalization methods: Reconstruct the normalization steps (e.g. total-count scaling to 1e6 or median library size) to pinpoint which approach matches the provided data.
- cell dropout: Investigate why 63 cells vanish post-normalization; do their QC metrics or expression sparsity explain the removal?
- gene dropout: Examine the 15,945 genes dropped; does variance or dispersion alone predict their exclusion, or is a more sophisticated highly variable gene selection algorithm (e.g. Scanpy’s Seurat flavor) required?
Together, these analyses reveal exactly how filtering and normalization reshape our data, and why certain cells or genes are retained or discarded in the final training set.
Exploration¶
# Genes in filtered but not in normalised
dropped_genes = ss_mcf7_filt.index.difference(ss_mcf7_norm.index)
print(f"Genes dropped during normalization: {len(dropped_genes)}")
Genes dropped during normalization: 15945
dropped_cells = ss_mcf7_filt.columns.difference(ss_mcf7_norm.columns)
print(f"Cells dropped during normalization: {len(dropped_cells)}")
Cells dropped during normalization: 63
A substantial number of genes (15,945) and a smaller set of cells (63) are removed when we go from filtered to normalized. Let’s see how that impacts counts and detection rates.
# Total counts per cell = sum of all gene expression values
total_counts_before = ss_mcf7_filt.sum(axis=0)
# Number of expressed genes (non-zero) per cell
n_genes_before = (ss_mcf7_filt > 0).sum(axis=0)
total_counts_after = ss_mcf7_norm.sum(axis=0)
n_genes_after = (ss_mcf7_norm > 0).sum(axis=0)
print(f"Average total counts (before): {total_counts_before.mean():.2f}")
print(f"Average total counts (after): {total_counts_after.mean():.2f}\n")
print(f"Average n_genes per cell (before): {n_genes_before.mean():.2f}")
print(f"Average n_genes per cell (after): {n_genes_after.mean():.2f}")
Average total counts (before): 1157815.77 Average total counts (after): 347700.15 Average n_genes per cell (before): 10011.00 Average n_genes per cell (after): 1091.34
Total counts and number of genes were reduced by a factor of 10.
# Combine total counts
ss_mcf7_total_counts = pd.DataFrame({
'total_counts': pd.concat([total_counts_before, total_counts_after]),
'stage': ['Before'] * len(total_counts_before) + ['After'] * len(total_counts_after)
})
# Combine gene counts
ss_mcf7_n_genes = pd.DataFrame({
'n_genes': pd.concat([n_genes_before, n_genes_after]),
'stage': ['Before'] * len(n_genes_before) + ['After'] * len(n_genes_after)
})
plt.figure(figsize=(12, 5))
# Violin plot for total counts
plt.subplot(1, 2, 1)
sns.violinplot(data=ss_mcf7_total_counts, x='stage', y='total_counts', hue='stage', palette='Set2', legend=False)
plt.title("Total Counts per Cell")
plt.xlabel("")
# Violin plot for # of genes
plt.subplot(1, 2, 2)
sns.violinplot(data=ss_mcf7_n_genes, x='stage', y='n_genes', hue='stage', palette='Set2', legend=False)
plt.title("Number of Genes per Cell")
plt.xlabel("")
plt.suptitle("Before vs After Normalization", fontsize=14)
plt.tight_layout()
plt.show()
The “After” violins are noticeably narrower, indicating that cells have been rescaled to a common library size, thereby reducing variability in total counts and gene detection across cells.
# How much does mean (log-transformed) expression vary across cells
print("Filtered + log2 mean variance:", np.log2(ss_mcf7_filt + 1).var(axis=1).mean())
print("Normalised + log2 mean variance:", np.log2(ss_mcf7_norm + 1).var(axis=1).mean())
Filtered + log2 mean variance: 2.7114283938366706 Normalised + log2 mean variance: 3.9135395503089314
The normalised data has a higher mean variance per gene under log-transform. Let us examine this further:
ss_mcf7_filt_log = np.log2(ss_mcf7_filt + 1)
ss_mcf7_norm_log = np.log2(ss_mcf7_norm + 1)
filt_gene_var_log = ss_mcf7_filt_log.var(axis=1)
norm_gene_var_log = ss_mcf7_norm_log.var(axis=1)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Raw variance
sns.histplot(filt_gene_var_log, bins=100, color='red', stat='density', ax=axes[0])
axes[0].set_title("Log2 Gene Variance in Filtered Data")
axes[0].set_xlabel("Variance")
# Log-transformed variance
sns.histplot(norm_gene_var_log, bins=100, color='blue', stat='density', ax=axes[1])
axes[1].set_title("Log2 Gene Variance in Normalised Data")
axes[1].set_xlabel("Variance")
plt.tight_layout()
plt.show()
The left plot shows the distribution of gene variance (log2-transformed) in the filtered data. Most genes exhibit very low variance, with only a few showing substantial variability. After normalization (right plot), the distribution becomes broader, with more genes showing moderate-to-high variance (histogram shifts to the right). This indicates improved dynamic range and suggests that normalization helped preserve biologically variable genes while reducing the influence of low-variance or uninformative ones.
# Calculate variance per gene in filtered data
gene_var = ss_mcf7_filt.var(axis=1)
# Get top 3000 highly variable genes
top_var_genes = gene_var.sort_values(ascending=False).head(3000)
# Compare to normalized gene set
kept_genes = ss_mcf7_norm.index
# How many retained genes overlap with the top variable genes?
overlap = kept_genes.intersection(top_var_genes.index)
print(f"{len(overlap)} of {len(kept_genes)} genes in normalized data are among the top 3000 variable genes.")
894 of 3000 genes in normalized data are among the top 3000 variable genes.
It appears that a more advanced technique was used to pinpoint the top 3000 variable genes. We explore this further in section 'Why 15945 Genes Were Dropped?'.
What normalisation could have been applied?¶
Let's try rescaling the filtered data so that each cell has a total of 1M counts.
sc.pp.normalize_total(adata, target_sum=1e6) # each cell ends up with a total of 1,000,000 counts
ss_mcf7_norm_like = ss_mcf7_filt.div(ss_mcf7_filt.sum(axis=0), axis=1) * 1e6 # manual reimplementation of Scanpy’s normalize_total
total_counts_norm_like = ss_mcf7_norm_like.sum(axis=0)
print("Total counts after rescaling to 1e6:", total_counts_norm_like.describe()) # we expect min = max = mean = 1e6
# Take the intersection of shared genes and cells (250, 3000)
common_cells = ss_mcf7_norm.columns.intersection(ss_mcf7_norm_like.columns)
common_genes = ss_mcf7_norm.index.intersection(ss_mcf7_norm_like.index)
diff = (ss_mcf7_norm.loc[common_genes, common_cells] -
ss_mcf7_norm_like.loc[common_genes, common_cells]).abs().mean().mean() # mean across all genes and all cells
print(f"Mean abs difference to scaled-to-1e6 normalization: {diff:.2f}")
Total counts after rescaling to 1e6: count 3.130000e+02 mean 1.000000e+06 std 5.434847e-11 min 1.000000e+06 25% 1.000000e+06 50% 1.000000e+06 75% 1.000000e+06 max 1.000000e+06 dtype: float64 Mean abs difference to scaled-to-1e6 normalization: 19.31
Absolute differences are small (~19 across ~1M-range values), but they don't tell us much because expression values span a large range.
# Flatten both matrices and extract the common genes/cells
flat_original = ss_mcf7_norm.loc[common_genes, common_cells].values.flatten()
flat_recreated = ss_mcf7_norm_like.loc[common_genes, common_cells].values.flatten()
# Compute Pearson correlation
cor = np.corrcoef(flat_original, flat_recreated)[0,1]
print(f"Pearson correlation of expression values: {cor:.4f}")
Pearson correlation of expression values: 0.9999
There is an almost perfect linear relationship between the expression values in the two matrices, which makes sene as we are just scaling ss_mcf7_filt and so if a value is high in one matrix, it will be high in the other as well. Nevertheless, the relatively small absolute difference suggests that we are on the right track.
Next we look at Scanpy's suggested approach: normalising to median total counts.
# Scanpy Normalisation
# Transpose the matrix to match AnnData convention: cells as rows
X = ss_mcf7_filt.T # shape: (cells × genes)
# Convert to AnnData
adata = ad.AnnData(X=X)
# Optional: name the genes and cells
adata.var_names = ss_mcf7_filt.index
adata.obs_names = ss_mcf7_filt.columns
# Normalise total counts per cell (default target_sum is median library size)
sc.pp.normalize_total(adata)
# Convert back to DataFrame
adata_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
# Transpose back so that both datasets are genes × cells
adata_df_T = adata_df.T
# Get common genes and cells
common_genes = ss_mcf7_norm.index.intersection(adata_df_T.index)
common_cells = ss_mcf7_norm.columns.intersection(adata_df_T.columns)
# Compute absolute difference
diff_matrix = (ss_mcf7_norm.loc[common_genes, common_cells] -
adata_df_T.loc[common_genes, common_cells]).abs()
# Mean absolute difference
mean_abs_diff = diff_matrix.mean().mean()
print(f"Mean absolute difference: {mean_abs_diff:.2f}")
# Pearson correlation between flattened matrices
flat_orig = ss_mcf7_norm.loc[common_genes, common_cells].values.flatten()
flat_scanpy = adata_df_T.loc[common_genes, common_cells].values.flatten()
cor = np.corrcoef(flat_orig, flat_scanpy)[0, 1]
print(f"Pearson correlation: {cor:.4f}")
Mean absolute difference: 0.77 Pearson correlation: 0.9999
Normalizing to each cell’s median library size produces an almost perfect match - confirming that Scanpy’s default normalize_total (with median target) is the likely pipeline.
Is normalising to median counts a good approach? Yes!
Cells vary in sequencing depth: Some cells may have more total counts just due to being more deeply sequenced, not because they express more genes biologically. Total count normalization controls for this technical variability by rescaling each cell to have the same total count, making gene expression values comparable across cells. Using the median total count instead of a fixed value (like 1e5 or 1e6) ensures the scaling is dataset-specific and robust to outliers.
Why 63 Cells Were Dropped?¶
# Quality Control for the dropped cells in the original space
dropped_cells = ss_mcf7_filt.columns.difference(ss_mcf7_norm.columns)
qc_metrics = pd.DataFrame({
"total_counts": ss_mcf7_filt.sum(axis=0),
"n_genes": (ss_mcf7_filt > 0).sum(axis=0)
})
qc_metrics.loc[dropped_cells].describe()
| total_counts | n_genes | |
|---|---|---|
| count | 6.300000e+01 | 63.000000 |
| mean | 1.104397e+06 | 9660.587302 |
| std | 5.121923e+05 | 1192.260534 |
| min | 2.633430e+05 | 6433.000000 |
| 25% | 7.101980e+05 | 8902.500000 |
| 50% | 1.229842e+06 | 9791.000000 |
| 75% | 1.515264e+06 | 10590.500000 |
| max | 1.982038e+06 | 11780.000000 |
# Quality Control for the retained cells in the original space
cells_retained = ss_mcf7_norm.columns
qc_metrics.loc[cells_retained].describe()
| total_counts | n_genes | |
|---|---|---|
| count | 2.500000e+02 | 250.000000 |
| mean | 1.171277e+06 | 10099.304000 |
| std | 3.613786e+05 | 1253.815198 |
| min | 2.847180e+05 | 5322.000000 |
| 25% | 9.383340e+05 | 9443.250000 |
| 50% | 1.198204e+06 | 10303.000000 |
| 75% | 1.448138e+06 | 10955.250000 |
| max | 1.970851e+06 | 12217.000000 |
QC metrics (total counts, genes detected) are nearly identical for dropped vs. retained cells, so their removal was probably tied to the gene-selection step rather than poor quality.
# Subset filtered matrix to just the 3000 genes in ss_mcf7_norm
genes_to_keep = ss_mcf7_norm.index
cells_to_check = ss_mcf7_filt.columns.difference(ss_mcf7_norm.columns)
subset = ss_mcf7_filt.loc[genes_to_keep, cells_to_check]
# Number of expressed genes (non-zero) per dropped cell
nonzeros_per_cell = (subset > 0).sum(axis=0)
nonzeros_per_cell.describe()
count 63.000000 mean 1037.968254 std 149.883715 min 662.000000 25% 927.500000 50% 1070.000000 75% 1155.500000 max 1322.000000 dtype: float64
# Retained cells
cells_retained = ss_mcf7_norm.columns
subset_retained = ss_mcf7_filt.loc[genes_to_keep, cells_retained]
nonzeros_retained = (subset_retained > 0).sum(axis=0)
nonzeros_retained.describe()
count 250.00000 mean 1082.24800 std 167.51017 min 574.00000 25% 982.25000 50% 1092.00000 75% 1197.75000 max 1455.00000 dtype: float64
Dropped cells still have thousands of detected genes among the top 3k set, so no obvious sparsity issue explains their removal.
# Function to extract just the condition (Hypo or Norm)
def extract_condition(colname):
return colname.split('_')[2] # "Hypo" or "Norm"
# Apply to all cells in filtered data
condition_all = ss_mcf7_filt.columns.to_series().apply(extract_condition)
# Apply to only the 250 retained cells
condition_retained = condition_all.loc[ss_mcf7_norm.columns]
# Count full and retained condition distributions
print("Condition distribution in full filtered set (313 cells):")
print(condition_all.value_counts())
print("\nCondition distribution in retained normalized set (250 cells):")
print(condition_retained.value_counts())
Condition distribution in full filtered set (313 cells): Norm 158 Hypo 155 Name: count, dtype: int64 Condition distribution in retained normalized set (250 cells): Norm 126 Hypo 124 Name: count, dtype: int64
The hypoxia vs. normoxia balance remains consistent, suggesting no condition-specific bias in cell dropout.
It is unclear to us why exactly those 63 cells were dropped.
Why 15945 Genes Were Dropped?¶
First we examine whether the genes could have been dropped due to low variance or low disperson:
genes_norm = ss_mcf7_norm.index
ss_mcf7_norm_like = ss_mcf7_filt.div(ss_mcf7_filt.sum(axis=0), axis=1) * 1e5
gene_stats = pd.DataFrame({
'mean': ss_mcf7_norm_like.mean(axis=1),
'variance': ss_mcf7_norm_like.var(axis=1)
})
gene_stats_sorted = gene_stats.sort_values(by='variance', ascending=False)
top_3000_var = gene_stats_sorted.head(3000)
genes_predicted = top_3000_var.index
# How many genes overlap?
overlap = genes_predicted.intersection(genes_norm)
print(f"Overlap with ss_mcf7_norm genes (variance): {len(overlap)} / 3000")
gene_stats['dispersion'] = gene_stats['variance'] / gene_stats['mean']
gene_stats_filtered = gene_stats[gene_stats['mean'] > 0] # avoid div by zero
top_3000_disp = gene_stats_filtered.sort_values(by='dispersion', ascending=False).head(3000)
overlap_disp = top_3000_disp.index.intersection(genes_norm)
print(f"Overlap with ss_mcf7_norm genes (dispersion): {len(overlap_disp)} / 3000")
Overlap with ss_mcf7_norm genes (variance): 930 / 3000 Overlap with ss_mcf7_norm genes (dispersion): 1572 / 3000
High variance is not a good predictor for which genes were retained. Dispersion is better, but still not good enough, so we try Scanpy's HVG selection.
# Use our normalisation
ss_mcf7_norm_full = adata_df.T
# Convert to AnnData
adata = anndata.AnnData(X=np.log1p(ss_mcf7_norm_full.T.astype(float))) # See Note below
adata.var_names = ss_mcf7_filt.index
adata.obs_names = ss_mcf7_filt.columns
# Run HVG selection on the full gene set
sc.pp.highly_variable_genes(
adata,
flavor='seurat',
n_top_genes=3000,
inplace=True
)
# Filter to top 3,000 genes
adata = adata[:, adata.var['highly_variable']]
# Get the 3000 HVG gene names just selected by Scanpy
hvgs_from_filt = adata.var_names
# Get the original 3000 gene names from the ss_mcf7_norm matrix
hvgs_from_norm = ss_mcf7_norm.index
# Compute overlap
overlap = hvgs_from_filt.intersection(hvgs_from_norm)
print(f"Overlap: {len(overlap)} / 3000")
print(f"Percenatge Overlap: {len(overlap)/ 3000 * 100}")
Overlap: 2346 / 3000 Percenatge Overlap: 78.2
Using Scanpy’s Seurat-flavored highly_variable_genes reproduces ~78% of the official gene set, confirming that HVG selection drove most gene exclusions. 78% overlap is very good because these 3000 genes make up only 15% of the genes in ss_mcf7_filt.
Note: The
log1ptransformation is applied to stabilize variance and reduce the influence of highly expressed genes. This makes the selection of highly variable genes (HVGs) more biologically meaningful and statistically robust. This approach follows best practices from the Scanpy tutorials.
SmartSeq HCC1806 Raw vs Filtered¶
Here we repeat the gene- and cell-filtering comparison for the HCC1806 line, using the same “expressed in >5 cells” rule and QC thresholds as before. We’ll quantify how many genes and cells our simple filters remove versus the provided ss_hcc_filt dataset, and inspect any discrepancies for signs of manual curation or additional criteria.
Gene Filtering¶
ss_hcc_raw_filt.shape
(23339, 233)
ss_hcc_filt.shape
(19503, 227)
genes_raw = set(ss_hcc_raw.index)
genes_filtered = set(ss_hcc_filt.index)
dropped_genes = genes_raw - genes_filtered
print(f"Genes dropped: {len(dropped_genes)}")
Genes dropped: 3893
We see how many total genes the official filter removed compared to the raw set. We apply the same expression threshold as for MCF7:
# Genes that are expressed in more than 5 cells
ss_hcc_raw_genes_mask = (ss_hcc_raw > 0).sum(axis=1) > 5 # higher threshold results in less than 19503 genes remaining
ss_hcc_raw_gene_set = set(ss_hcc_raw.index[ss_hcc_raw_genes_mask])
print(f"Genes passing our threshold: {len(ss_hcc_raw_gene_set)}")
Genes passing our threshold: 19681
Our threshold drops far fewer genes than the official filter, indicating extra curation steps in the pipeline.
ss_hcc_raw_gene_set = ss_hcc_raw.index[ss_hcc_raw_genes_mask]
ss_hcc_filt_gene_set = ss_hcc_filt.index
overlap = ss_hcc_raw_gene_set.intersection(ss_hcc_filt_gene_set)
print(f"Overlap: {len(overlap)} / {len(ss_hcc_filt_gene_set)}")
Overlap: 19503 / 19503
# Convert Indexes to sets
ss_hcc_raw_gene_set = set(ss_hcc_raw_gene_set)
ss_hcc_filt_gene_set = set(ss_hcc_filt_gene_set)
# Find extra genes (present in the threshold but not in the filtered set)
extra_genes = ss_hcc_raw_gene_set - ss_hcc_filt_gene_set
extra_genes_list = list(extra_genes)
ss_hcc_raw.loc[extra_genes_list].mean(axis=1).describe()
count 178.000000 mean 0.321589 std 0.438802 min 0.024691 25% 0.058642 50% 0.164609 75% 0.421811 max 3.720165 dtype: float64
Similarly to the MCF7 case, we have 178 genes filtered out of ss_hcc_raw that pass the above threshold and were most likely removed by manual curation.
Cell Filtering¶
dropped_cells = ss_hcc_raw.shape[1] - ss_hcc_filt.shape[1]
print(f"Number of removed cells: {dropped_cells}")
Number of removed cells: 16
qc_cells = pd.DataFrame({
"total_counts": ss_hcc_raw.sum(axis=0),
"n_genes": (ss_hcc_raw > 0).sum(axis=0)
})
retained_cells = ss_hcc_filt.columns
dropped_cells = ss_hcc_raw.columns.difference(retained_cells)
qc_retained = qc_cells.loc[retained_cells]
qc_dropped = qc_cells.loc[dropped_cells]
print("Retained cells:")
print(qc_retained.describe())
print("\nDropped cells:")
print(qc_dropped.describe())
Retained cells:
total_counts n_genes
count 2.270000e+02 227.000000
mean 2.095821e+06 10735.555066
std 1.084443e+06 1025.490256
min 3.421620e+05 7361.000000
25% 1.028269e+06 10260.500000
50% 2.157315e+06 10881.000000
75% 2.965222e+06 11431.500000
max 4.858841e+06 12698.000000
Dropped cells:
total_counts n_genes
count 1.600000e+01 16.000000
mean 8.274348e+05 4581.625000
std 1.683270e+06 5370.398208
min 1.140000e+02 35.000000
25% 4.207500e+02 84.000000
50% 3.734050e+04 884.000000
75% 2.899445e+05 9234.500000
max 5.758132e+06 13986.000000
Similar to the MCF7 case, total counts and detected genes have lower means among the dropped cells compared to the retained cells. For this reason, we filter out the cells with low total counts and low number of genes.
# Apply a candidate filter
cell_mask = (qc_cells['total_counts'] > 250_000) & (qc_cells['n_genes'] > 4000)
filtered_candidate = set(qc_cells.index[cell_mask])
# Cells in the original filtered dataset
original_filtered = set(ss_hcc_filt.columns)
# Overlap
overlap = filtered_candidate.intersection(original_filtered)
print(f"Candidate filter retains: {len(filtered_candidate)} cells")
print(f"Overlap with ss_hcc_filt: {len(overlap)} / {len(original_filtered)} ({len(overlap)/len(original_filtered)*100:.1f}%)")
Candidate filter retains: 230 cells Overlap with ss_hcc_filt: 227 / 227 (100.0%)
Experimenting with multiple total counts and number of genes thresholds, we obtain that >250,000 and >4,000, respectively, are the best threshold-based filtering rules. Only 3 cells remain in our filtered dataset that are not present in ss_mcf7_filt and there is a perfect overlap for the other 227 cells. The remaining 3 cells were likely removed manually.
SmartSeq HCC1806 Filtered vs Normalised + Filtered¶
In this section, we examine how normalization and the final 3,000-gene selection reshape the filtered HCC1806 dataset. We explore:
- exploration: how many genes and cells are lost during normalization, and how do per-cell totals and gene counts change?
- normalization method: reconstruct Scanpy’s median-library-size normalization to confirm it matches the provided normalized matrix
- cell retention: compare QC metrics of dropped vs. retained cells to understand the basis of cell removal
- gene retention: use Scanpy’s highly variable gene (HVG) selection to see if it explains which 3,000 genes remain
Exploration¶
# Genes in filtered but not in normalised
dropped_genes = ss_hcc_filt.index.difference(ss_hcc_norm.index)
print(f"Genes dropped during normalization: {len(dropped_genes)}")
Genes dropped during normalization: 16503
dropped_cells = ss_hcc_filt.columns.difference(ss_hcc_norm.columns)
print(f"Cells dropped during normalization: {len(dropped_cells)}")
Cells dropped during normalization: 45
A large number of genes (~16 503) and a moderate number of cells (45) are removed in the transition to normalized data.
# Total counts per cell = sum of all gene expression values
total_counts_before = ss_hcc_filt.sum(axis=0)
# Number of expressed genes (non-zero) per cell
n_genes_before = (ss_hcc_filt > 0).sum(axis=0)
total_counts_after = ss_hcc_norm.sum(axis=0)
n_genes_after = (ss_hcc_norm > 0).sum(axis=0)
print(f"Average total counts (before): {total_counts_before.mean():.2f}")
print(f"Average total counts (after): {total_counts_after.mean():.2f}\n")
print(f"Average n_genes per cell (before): {n_genes_before.mean():.2f}")
print(f"Average n_genes per cell (after): {n_genes_after.mean():.2f}")
Average total counts (before): 2095393.92 Average total counts (after): 502580.62 Average n_genes per cell (before): 10686.55 Average n_genes per cell (after): 880.40
Total counts and detected genes drop modestly after normalization, which is expected when rescaling each cell’s library size.
# Combine total counts
ss_hcc_total_counts = pd.DataFrame({
'total_counts': pd.concat([total_counts_before, total_counts_after]),
'stage': ['Before'] * len(total_counts_before) + ['After'] * len(total_counts_after)
})
# Combine gene counts
ss_hcc_n_genes = pd.DataFrame({
'n_genes': pd.concat([n_genes_before, n_genes_after]),
'stage': ['Before'] * len(n_genes_before) + ['After'] * len(n_genes_after)
})
plt.figure(figsize=(12, 5))
# Violin plot for total counts
plt.subplot(1, 2, 1)
sns.violinplot(data=ss_hcc_total_counts, x='stage', y='total_counts', hue='stage', palette='Set2', legend=False)
plt.title("Total Counts per Cell")
plt.xlabel("")
# Violin plot for # of genes
plt.subplot(1, 2, 2)
sns.violinplot(data=ss_hcc_n_genes, x='stage', y='n_genes', hue='stage', palette='Set2', legend=False)
plt.title("Number of Genes per Cell")
plt.xlabel("")
plt.suptitle("Before vs After Normalization", fontsize=14)
plt.tight_layout()
plt.show()
The “After” violins are tighter, reflecting uniform library sizes across cells.
print("Filtered + log2 mean variance:", np.log2(ss_hcc_filt + 1).var(axis=1).mean())
print("Normalised + log2 mean variance:", np.log2(ss_hcc_norm + 1).var(axis=1).mean())
Filtered + log2 mean variance: 3.11473854285522 Normalised + log2 mean variance: 3.076855767159178
Unlike MCF7, HCC1806 shows almost no change in mean gene variance after normalization.
Normalisation¶
We’ll compare this reconstructed matrix to the provided normalized data.
# Scanpy Normalisation
# Transpose the matrix to match AnnData convention: cells as rows
X = ss_hcc_filt.T # shape: (cells × genes)
# Convert to AnnData
adata = ad.AnnData(X=X)
# Optional: name the genes and cells
adata.var_names = ss_hcc_filt.index
adata.obs_names = ss_hcc_filt.columns
# Normalise total counts per cell (default target_sum is median library size)
sc.pp.normalize_total(adata)
# Our normalized + log-transformed matrix is now in:
adata.X # (sparse or dense depending on input)
# Convert back to DataFrame
adata_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
# Transpose back, so that both datasets are genes × cells
adata_df_T = adata_df.T
# Get common genes and cells
common_genes = ss_hcc_norm.index.intersection(adata_df_T.index)
common_cells = ss_hcc_norm.columns.intersection(adata_df_T.columns)
# Compute absolute difference
diff_matrix = (ss_hcc_norm.loc[common_genes, common_cells] -
adata_df_T.loc[common_genes, common_cells]).abs()
# 1. Mean absolute difference
mean_abs_diff = diff_matrix.mean().mean()
print(f"Mean absolute difference: {mean_abs_diff:.2f}")
# 2. Pearson correlation between flattened matrices
flat_orig = ss_hcc_norm.loc[common_genes, common_cells].values.flatten()
flat_scanpy = adata_df_T.loc[common_genes, common_cells].values.flatten()
cor = np.corrcoef(flat_orig, flat_scanpy)[0, 1]
print(f"Pearson correlation: {cor:.4f}")
Mean absolute difference: 1.38 Pearson correlation: 0.9998
Near-zero absolute mean difference and r≈1 confirm the provided matrix was normalized to median library size.
Dropped vs Retained Cells¶
# Quality Control for the dropped cells in the original space
dropped_cells = ss_hcc_filt.columns.difference(ss_hcc_norm.columns)
qc_metrics = pd.DataFrame({
"total_counts": ss_hcc_filt.sum(axis=0),
"n_genes": (ss_hcc_filt > 0).sum(axis=0)
})
qc_metrics.loc[dropped_cells].describe()
| total_counts | n_genes | |
|---|---|---|
| count | 4.500000e+01 | 45.000000 |
| mean | 2.495615e+06 | 10189.177778 |
| std | 1.176204e+06 | 1117.962641 |
| min | 6.968360e+05 | 7558.000000 |
| 25% | 1.100835e+06 | 9416.000000 |
| 50% | 2.853875e+06 | 10354.000000 |
| 75% | 3.351452e+06 | 10996.000000 |
| max | 4.858344e+06 | 11830.000000 |
# Quality Control for the retained cells in the original space
cells_retained = ss_hcc_norm.columns
qc_metrics.loc[cells_retained].describe()
| total_counts | n_genes | |
|---|---|---|
| count | 1.820000e+02 | 182.000000 |
| mean | 1.996438e+06 | 10809.527473 |
| std | 1.040057e+06 | 957.956139 |
| min | 3.421010e+05 | 7268.000000 |
| 25% | 1.001764e+06 | 10334.000000 |
| 50% | 1.974784e+06 | 10907.500000 |
| 75% | 2.791580e+06 | 11498.250000 |
| max | 4.774799e+06 | 12629.000000 |
Cell Filtering Outcome Summary¶
- A total of 227 cells were initially present.
- After normalisation + filtering, 182 cells were retained, and 45 were dropped.
Dropped Cells:¶
- Have higher average total counts (2.5M)
- But lower number of expressed genes (mean ≈ 10,189)
Retained Cells:¶
- Have slightly lower total counts (2.0M)
- But more genes detected per cell (mean ≈ 10,810)
However, the differences are not large enough to make any claims.
Now we restrict the filtered matrix to the 3000 genes in ss_mcf7_norm in order to investigate whether there are any substantial differences in number of expressed genes per cell for the dropped vs retained cells.
# Subset filtered matrix to just the 3000 genes in ss_mcf7_norm
genes_to_keep = ss_hcc_norm.index
cells_to_check = ss_hcc_filt.columns.difference(ss_hcc_norm.columns)
subset = ss_hcc_filt.loc[genes_to_keep, cells_to_check]
# Number of expressed genes (non-zero) per dropped cell
nonzeros_per_cell = (subset > 0).sum(axis=0)
nonzeros_per_cell.describe()
count 45.000000 mean 792.533333 std 116.710715 min 549.000000 25% 720.000000 50% 778.000000 75% 847.000000 max 1104.000000 dtype: float64
# Retained cells
subset_retained = ss_hcc_filt.loc[genes_to_keep, cells_retained]
# Number of expressed genes (non-zero) per retained cell
nonzeros_retained = (subset_retained > 0).sum(axis=0)
nonzeros_retained.describe()
count 182.000000 mean 870.175824 std 115.900805 min 549.000000 25% 793.500000 50% 872.500000 75% 944.500000 max 1169.000000 dtype: float64
There are no significant differences between the number of expressed genes per cell for dropped vs retained cells in the 3000 gene space.
It is not clear to us why exactly those 45 cells were dropped.
Dropped vs Retained Genes¶
HVG selection identifies the genes that show the most variability across cells, relative to their average expression level.
# Step 1: Use our normalisation
ss_hcc_norm_full = adata_df.T
# Step 2: Convert to AnnData
adata = anndata.AnnData(X=np.log1p(ss_hcc_norm_full.T.astype(float)))
adata.var_names = ss_hcc_filt.index
adata.obs_names = ss_hcc_filt.columns
# Step 3: Run HVG selection on the full gene set
sc.pp.highly_variable_genes(
adata,
flavor='seurat',
n_top_genes=3000,
inplace=True
)
# Step 4: Filter to top 3,000 genes
adata = adata[:, adata.var['highly_variable']]
# Get the 3000 HVG gene names just selected by Scanpy
hvgs_from_filt = adata.var_names
# Get the original 3000 gene names from the ss_mcf7_norm matrix
hvgs_from_norm = ss_hcc_norm.index
# Compute overlap
overlap = hvgs_from_filt.intersection(hvgs_from_norm)
print(f"Overlap: {len(overlap)} / 3000")
print(f"Percenatge Overlap: {len(overlap)/ 3000 * 100}")
Overlap: 2080 / 3000 Percenatge Overlap: 69.33333333333334
The overlap for HCC1806 (69%) is slightly worse than for MCF7 (78%), but it is still significant and we can say with high confidence that HVG selection was part of feature reduction.
DropSeq¶
In this section, we load the pre-filtered, normalized expression matrices (top 3,000 genes) generated by Drop-seq for both MCF7 and HCC1806 lines. We then:
- peek at the first few rows of each matrix to confirm that gene IDs and normalized counts look as expected
- check the overall dimensions to ensure we have the correct number of cells and features
This quick sanity check ensures that our Drop-seq data are correctly loaded.
ds_mcf7_norm = pd.read_csv("AILab2025/DropSeq/MCF7_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)
ds_hcc_norm = pd.read_csv("AILab2025/DropSeq/HCC1806_Filtered_Normalised_3000_Data_train.txt",delimiter=" ",engine='python',index_col=0)
ds_mcf7_norm.shape
(3000, 21626)
ds_mcf7_norm.head(5)
| AAAAACCTATCG_Normoxia | AAAACAACCCTA_Normoxia | AAAACACTCTCA_Normoxia | AAAACCAGGCAC_Normoxia | AAAACCTAGCTC_Normoxia | AAAACCTCCGGG_Normoxia | AAAACTCGTTGC_Normoxia | AAAAGAGCTCTC_Normoxia | AAAAGCTAGGCG_Normoxia | AAAATCGCATTT_Normoxia | ... | TTTTACAGGATC_Hypoxia | TTTTACCACGTA_Hypoxia | TTTTATGCTACG_Hypoxia | TTTTCCAGACGC_Hypoxia | TTTTCGCGCTCG_Hypoxia | TTTTCGCGTAGA_Hypoxia | TTTTCGTCCGCT_Hypoxia | TTTTCTCCGGCT_Hypoxia | TTTTGTTCAAAG_Hypoxia | TTTTTTGTATGT_Hypoxia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MALAT1 | 1 | 3 | 3 | 6 | 4 | 5 | 1 | 13 | 3 | 3 | ... | 0 | 2 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 4 |
| MT-RNR2 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 1 | 7 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| NEAT1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| H1-5 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| TFF1 | 4 | 1 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 1 | ... | 2 | 3 | 8 | 0 | 0 | 3 | 4 | 2 | 6 | 0 |
5 rows × 21626 columns
ds_mcf7_norm.describe()
| AAAAACCTATCG_Normoxia | AAAACAACCCTA_Normoxia | AAAACACTCTCA_Normoxia | AAAACCAGGCAC_Normoxia | AAAACCTAGCTC_Normoxia | AAAACCTCCGGG_Normoxia | AAAACTCGTTGC_Normoxia | AAAAGAGCTCTC_Normoxia | AAAAGCTAGGCG_Normoxia | AAAATCGCATTT_Normoxia | ... | TTTTACAGGATC_Hypoxia | TTTTACCACGTA_Hypoxia | TTTTATGCTACG_Hypoxia | TTTTCCAGACGC_Hypoxia | TTTTCGCGCTCG_Hypoxia | TTTTCGCGTAGA_Hypoxia | TTTTCGTCCGCT_Hypoxia | TTTTCTCCGGCT_Hypoxia | TTTTGTTCAAAG_Hypoxia | TTTTTTGTATGT_Hypoxia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | ... | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 |
| mean | 0.034000 | 0.030333 | 0.027000 | 0.032333 | 0.045333 | 0.047333 | 0.030000 | 0.027333 | 0.032000 | 0.027333 | ... | 0.052333 | 0.043667 | 0.033667 | 0.033000 | 0.025333 | 0.037000 | 0.046333 | 0.055667 | 0.038000 | 0.033000 |
| std | 0.277254 | 0.220823 | 0.195662 | 0.233751 | 0.246235 | 0.299649 | 0.204403 | 0.292030 | 0.281074 | 0.237918 | ... | 0.364654 | 0.244499 | 0.340449 | 0.302117 | 0.208261 | 0.286924 | 0.301469 | 0.358623 | 0.240642 | 0.244808 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 4.000000 | 4.000000 | 5.000000 | 6.000000 | 4.000000 | 8.000000 | 6.000000 | 13.000000 | 7.000000 | 6.000000 | ... | 7.000000 | 4.000000 | 10.000000 | 8.000000 | 6.000000 | 7.000000 | 7.000000 | 9.000000 | 6.000000 | 6.000000 |
8 rows × 21626 columns
ds_hcc_norm.shape
(3000, 14682)
ds_hcc_norm.head(5)
| AAAAAACCCGGC_Normoxia | AAAACCGGATGC_Normoxia | AAAACGAGCTAG_Normoxia | AAAACTTCCCCG_Normoxia | AAAAGCCTACCC_Normoxia | AAACACAAATCT_Normoxia | AAACCAAGCCCA_Normoxia | AAACCATGCACT_Normoxia | AAACCTCCGGCT_Normoxia | AAACGCCGGTCC_Normoxia | ... | TTTTCTGATGGT_Hypoxia | TTTTGATTCAGA_Hypoxia | TTTTGCAACTGA_Hypoxia | TTTTGCCGGGCC_Hypoxia | TTTTGTTAGCCT_Hypoxia | TTTTTACCAATC_Hypoxia | TTTTTCCGTGCA_Hypoxia | TTTTTGCCTGGG_Hypoxia | TTTTTGTAACAG_Hypoxia | TTTTTTTGAATC_Hypoxia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| H1-5 | 2 | 2 | 5 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 3 | 1 |
| MALAT1 | 3 | 3 | 2 | 3 | 12 | 3 | 1 | 2 | 0 | 0 | ... | 3 | 1 | 1 | 1 | 4 | 0 | 4 | 1 | 3 | 6 |
| MT-RNR2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 2 | 2 | 2 | 0 | 0 | 1 | 0 | 1 | 0 |
| ARVCF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BCYRN1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 2 | 0 | 3 | ... | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
5 rows × 14682 columns
ds_hcc_norm.describe()
| AAAAAACCCGGC_Normoxia | AAAACCGGATGC_Normoxia | AAAACGAGCTAG_Normoxia | AAAACTTCCCCG_Normoxia | AAAAGCCTACCC_Normoxia | AAACACAAATCT_Normoxia | AAACCAAGCCCA_Normoxia | AAACCATGCACT_Normoxia | AAACCTCCGGCT_Normoxia | AAACGCCGGTCC_Normoxia | ... | TTTTCTGATGGT_Hypoxia | TTTTGATTCAGA_Hypoxia | TTTTGCAACTGA_Hypoxia | TTTTGCCGGGCC_Hypoxia | TTTTGTTAGCCT_Hypoxia | TTTTTACCAATC_Hypoxia | TTTTTCCGTGCA_Hypoxia | TTTTTGCCTGGG_Hypoxia | TTTTTGTAACAG_Hypoxia | TTTTTTTGAATC_Hypoxia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3000.00000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.00000 | 3000.000000 | 3000.000000 | ... | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 |
| mean | 0.02900 | 0.041667 | 0.024333 | 0.021667 | 0.029667 | 0.020000 | 0.036000 | 0.02600 | 0.034000 | 0.029333 | ... | 0.043000 | 0.049667 | 0.037000 | 0.047667 | 0.057000 | 0.023333 | 0.041667 | 0.041667 | 0.043333 | 0.040000 |
| std | 0.23276 | 0.309778 | 0.231860 | 0.189409 | 0.323761 | 0.170126 | 0.250449 | 0.23525 | 0.231362 | 0.218683 | ... | 0.271739 | 0.319219 | 0.279864 | 0.259648 | 0.304053 | 0.214797 | 0.236536 | 0.285116 | 0.267356 | 0.282418 |
| min | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 5.00000 | 9.000000 | 7.000000 | 4.000000 | 12.000000 | 3.000000 | 4.000000 | 6.00000 | 4.000000 | 4.000000 | ... | 4.000000 | 7.000000 | 7.000000 | 4.000000 | 5.000000 | 4.000000 | 4.000000 | 5.000000 | 5.000000 | 6.000000 |
8 rows × 14682 columns
We immediately notice that Drop-seq data takes much smaller values and the majority of these are zeroes.
Sequencing Technology Comparison¶
Smart-seq captures full-length transcripts with high sensitivity, enabling detection of lowly expressed genes and isoform analysis, but is limited to fewer cells (250 and 182) due to higher cost and lower throughput. Drop-seq profiles thousands of cells (21626 and 14682) by capturing only the 3′ ends of transcripts and using UMIs for quantification. While Drop-seq is more scalable, it produces sparser data with lower gene detection per cell, making it more suitable for large-scale cell population studies than detailed transcriptomic analysis.
Let us compare the proportion of zero entries in our data for Smart-seq vs Drop-seq:
smart_zero_prop = (ss_mcf7_norm == 0).sum().sum() / ss_mcf7_norm.size
drop_zero_prop = (ds_mcf7_norm == 0).sum().sum() / ds_mcf7_norm.size
print(f"Sparsity (Smart-seq MCF7): {smart_zero_prop:.2%}")
print(f"Sparsity (Drop-seq MCF7): {drop_zero_prop:.2%}")
Sparsity (Smart-seq MCF7): 63.62% Sparsity (Drop-seq MCF7): 97.53%
smart_zero_prop = (ss_hcc_norm == 0).sum().sum() / ss_hcc_norm.size
drop_zero_prop = (ds_hcc_norm == 0).sum().sum() / ds_hcc_norm.size
print(f"Sparsity (Smart-seq HCC1806): {smart_zero_prop:.2%}")
print(f"Sparsity (Drop-seq HCC1806): {drop_zero_prop:.2%}")
Sparsity (Smart-seq HCC1806): 70.65% Sparsity (Drop-seq HCC1806): 97.64%
Drop-Seq data is indeed more sparse.
Label Distribution¶
Now let us check whether the classes (hypoxia, normoxia) are balanced in these datasets:
hypo_count = sum('hypo' in col.lower() for col in ds_mcf7_norm.columns)
norm_count = sum('norm' in col.lower() for col in ds_mcf7_norm.columns)
print(f"Hypoxic samples: {hypo_count}")
print(f"Normoxic samples: {norm_count}")
total = hypo_count + norm_count
print(f"Hypoxic: {hypo_count/total:.2%}, Normoxic: {norm_count/total:.2%}")
Hypoxic samples: 8921 Normoxic samples: 12705 Hypoxic: 41.25%, Normoxic: 58.75%
hypo_count = sum('hypo' in col.lower() for col in ds_hcc_norm.columns)
norm_count = sum('norm' in col.lower() for col in ds_hcc_norm.columns)
print(f"Hypoxic samples: {hypo_count}")
print(f"Normoxic samples: {norm_count}")
total = hypo_count + norm_count
print(f"Hypoxic: {hypo_count/total:.2%}, Normoxic: {norm_count/total:.2%}")
Hypoxic samples: 8899 Normoxic samples: 5783 Hypoxic: 60.61%, Normoxic: 39.39%
Both Drop-seq datasets show moderate class imbalance. In MCF7, normoxic cells are more abundant, while in HCC1806, hypoxic cells dominate. This imbalance may influence classifier learning dynamics, potentially biasing models toward the majority class if not handled carefully.
Correlation between gene expression profiles¶
Mean pairwise cell–cell correlation is the average Pearson correlation between the expression profiles of all pairs of cells.
def mean_pairwise_correlation(df, method='pearson'):
"""
Computes the mean pairwise correlation between columns.
Parameters:
df (pd.DataFrame): The expression matrix.
method (str): 'pearson', 'spearman', or 'kendall'.
Returns:
float: Mean pairwise correlation (excluding self-correlations).
"""
cor_matrix = df.corr(method=method)
upper_tri_values = cor_matrix.values[np.triu_indices_from(cor_matrix, k=1)]
return upper_tri_values.mean()
print("Mean pairwise cell–cell correlation")
print(f"Raw MCF7: {mean_pairwise_correlation(ss_mcf7_raw):.4f}")
print(f"Filtered MCF7: {mean_pairwise_correlation(ss_mcf7_filt):.4f}")
print(f"Normalised MCF7: {mean_pairwise_correlation(ss_mcf7_norm):.4f}")
print(f"Raw HCC1806: {mean_pairwise_correlation(ss_hcc_raw):.4f}")
print(f"Filetered HCC1806: {mean_pairwise_correlation(ss_hcc_filt):.4f}")
print(f"Normalised HCC1806: {mean_pairwise_correlation(ss_hcc_norm):.4f}")
Mean pairwise cell–cell correlation Raw MCF7: 0.6720 Filtered MCF7: 0.7211 Normalised MCF7: 0.6654 Raw HCC1806: 0.7405 Filetered HCC1806: 0.7985 Normalised HCC1806: 0.7480
In both MCF7 and HCC1806, going from raw to filtered increases correlation. This indicates:
- Low-quality or noisy cells were successfully removed
- The retained cells share more biologically consistent expression profiles
After normalization and HVG selection, the mean pairwise correlation between cells drops slightly in both cell lines. This reflects two effects: normalization removes global differences in sequencing depth, reducing technical variability; meanwhile, highly variable gene (HVG) filtering discards stable, housekeeping genes and retains genes that emphasize biological differences between cells. Together, these steps increase the biological resolution of the data but can reduce overall similarity across cells.
Nevertheless, all versions of the MCF7 and HCC1806 Smart-seq data (raw, filtered, and filtered + normalized) exhibit strong mean pairwise cell–cell correlations (~0.6–0.8). This confirms good internal consistency (cohesive cell populations, low noise) across samples and suggests that preprocessing steps effectively preserved the underlying biological structure of the data.
Unsupervised Learning¶
PCA¶
To distill the high-dimensional single-cell expression matrices into their most informative axes, we apply PCA across all four dataset–condition combinations. Our goals are:
- Variance capture: Determine the smallest set of PCs explaining ≥95% of the total variance, thereby retaining the bulk of biological signal while discarding noise and redundancy for downstream modeling.
- Preprocessing impact: Compare unscaled vs. unit-variance-scaled PCA to understand how per-gene normalization redistributes variance and affects our ability to separate hypoxia vs. normoxia states.
Unit-variance scaling equalizes gene variances - potentially up-weighting technical noise from lowly expressed genes - whereas unscaled PCA preserves the raw variance structure, emphasizing dominant biological patterns. By quantifying both the cumulative‐variance and the classification power of leading PCs in each regime, we assess the robustness of hypoxia signals and provide practical guidance for preprocessing choices in single-cell analyses.
Helper functions¶
This cell defines a set of helper functions for streamlined PCA-based analysis and diagnostic evaluation:
get_top_components
- Runs PCA on an AnnData object
- Plots a scree plot of explained variance
- Computes and stores the number of PCs required to explain 95% of variance
- Saves both the count and the reduced coordinates in
adata
add_labels
- Parses cell names in
adata.obs_names - Adds a binary
conditioncolumn (“Hypo” vs. “Norm”) for later evaluation
- Parses cell names in
best_linear_pc_split
- Performs pairwise logistic‐regression on the first
max_pcPCs - Uses cross-validation to find which two PCs best separate “Hypo” vs. “Norm”
- Performs pairwise logistic‐regression on the first
best_split
- Wraps
best_linear_pc_split - Prints the best PC pair and CV score
- Generates a scatterplot of cells on those two PCs, colored by
condition
- Wraps
These functions allow us to run unsupervised PCA and perform a supervised check to evaluate how our top components capture the hypoxia vs. normoxia signal.
warnings.filterwarnings('ignore')
def get_top_components(adata, n_pcs, plot=True):
"""this function runs PCA and returns the number of components needed to explain 95% of the variance
Parameters:
adata: AnnData object
n_pcs: number of principal components to compute
plot: whether to plot the variance explained per PC (scree plot)
Returns:
n_components_95: number of components needed to explain 95% of the variance
Additionally it adds the following to the adata object:
adata.uns['pca']['n_components_95']: number of components needed to explain 95% of the variance
adata.obsm['X_pca_95']: PCA coordinates for the first n_components_95
"""
# plot variance explained per PC (scree plot)
if plot:
sc.pl.pca_variance_ratio(adata, n_pcs=n_pcs, log=False)
# access the variance ratios
explained_var = adata.uns['pca']['variance_ratio'] # array of variance explained per PC
# cmpute cumulative sum
cumulative_var = np.cumsum(explained_var)
# find the number of PCs needed to reach 95% variance
n_components_95 = np.argmax(cumulative_var >= 0.95) + 1 # +1 because np.argmax is 0-based
print(f"Number of PCs needed to explain 95% variance: {n_components_95}")
# add to the adata object this information
adata.uns['pca']['n_components_95'] = n_components_95
adata.obsm['X_pca_95'] = adata.obsm['X_pca'][:, :n_components_95]
for i, var in enumerate(explained_var[:10]):
print(f"PC{i+1} explains: {var*100:.2f}%")
return n_components_95
def add_labels(adata):
"""this function will add labels to the adata object based on the cell names
"""
# add labels based on cell names
adata.obs['condition'] = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in adata.obs_names]
def best_linear_pc_split(adata, label_key='condition', max_pc=20):
"""
Finds the best pair of principal components (among first `max_pc`) for linearly separating the data.
Parameters:
- adata: AnnData object with PCA computed (`adata.obsm['X_pca']` must exist).
- label_key: column in `adata.obs` used as target for classification.
- max_pc: maximum number of PCs to consider (default=20).
Returns:
- best_pair: tuple (pc1, pc2) with 1-based PC indices.
- best_score: mean cross-validation accuracy.
"""
if 'X_pca' not in adata.obsm:
raise ValueError("Run sc.tl.pca(adata) first to compute PCA")
X_pca = adata.obsm['X_pca'][:, :max_pc]
y = adata.obs[label_key].values
best_pair = None
best_score = -np.inf
for i, j in combinations(range(max_pc), 2):
X_pair = X_pca[:, [i, j]]
clf = LogisticRegression(max_iter=1000)
score = cross_val_score(clf, X_pair, y, cv=5).mean()
if score > best_score:
best_score = score
best_pair = (i + 1, j + 1) # return 1-based PC indices
return best_pair, best_score
def best_split(adata, label_key='condition', max_pc=20):
"""draws the best split plot"""
best_pair, best_score = best_linear_pc_split(adata, label_key=label_key, max_pc=max_pc)
print(f"Best pair of PCs: {best_pair} with score: {best_score:.4f}")
plt.figure(figsize=(8, 6))
sns.scatterplot(x=adata.obsm['X_pca'][:, best_pair[0] - 1],
y=adata.obsm['X_pca'][:, best_pair[1] - 1],
hue=adata.obs[label_key],
palette='Set2',
s = 10)
plt.title(f"Best pair of PCs: {best_pair} with score: {best_score:.4f}")
plt.xlabel(f"PC{best_pair[0]}")
plt.ylabel(f"PC{best_pair[1]}")
plt.legend(title='condition')
plt.show()
SmartSeq¶
SmartSeq MCF7¶
We start with the non-scaled PCA:
X = ss_mcf7_norm.T # cells × genes
adata_ss_mcf7 = ad.AnnData(X)
adata_ss_mcf7.obs_names = ss_mcf7_norm.columns # adding cell names
adata_ss_mcf7.var_names = ss_mcf7_norm.index # adding gene names
add_labels(adata_ss_mcf7) # adding condition labels
sc.pp.pca(adata_ss_mcf7, n_comps=50, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_mcf7, n_pcs=50, plot=True)
# plot PCA
best_split(adata_ss_mcf7, label_key='condition', max_pc=10)
Number of PCs needed to explain 95% variance: 20 PC1 explains: 63.45% PC2 explains: 9.11% PC3 explains: 6.27% PC4 explains: 4.03% PC5 explains: 3.16% PC6 explains: 1.54% PC7 explains: 1.14% PC8 explains: 1.01% PC9 explains: 0.90% PC10 explains: 0.76% Best pair of PCs: (1, 6) with score: 0.9920
The first 20 principal components capture ≥95% of the total variance in the untransformed, unscaled data. The best 2-PC split is PC1 vs PC6 with 0.992 accuracy - these two axes yield nearly perfect linear separation of ‘Hypo’ vs ‘Norm’ cells. The unscaled data are strongly dominated by PC1 (over 60% variance), suggesting a single major gradient separates the samples. Yet PC6 adds the extra discriminative power needed for clean classification.
Now let's look at the scaled option:
X_log = np.log1p(ss_mcf7_norm.T) # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log) # scaling the data
adata_ss_mcf7_scaled = ad.AnnData(X_log)
adata_ss_mcf7_scaled.obs_names = ss_mcf7_norm.columns # adding cell names
adata_ss_mcf7_scaled.var_names = ss_mcf7_norm.index # adding gene names
add_labels(adata_ss_mcf7_scaled) # adding condition labels
sc.pp.pca(adata_ss_mcf7_scaled, n_comps=249, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_mcf7_scaled, n_pcs=249, plot=True)
# plot PCA
best_split(adata_ss_mcf7_scaled, label_key='condition', max_pc=50)
Number of PCs needed to explain 95% variance: 204 PC1 explains: 18.64% PC2 explains: 5.55% PC3 explains: 4.02% PC4 explains: 2.24% PC5 explains: 1.77% PC6 explains: 1.47% PC7 explains: 1.26% PC8 explains: 1.16% PC9 explains: 1.04% PC10 explains: 0.97% Best pair of PCs: (1, 2) with score: 0.9920
Enforcing unit variance per gene spreads the variance more evenly across many components, leaving us with 204 PCs for explaining >95% of variance. The best 2-PC split is achieved with PC1 and PC2 and yields 0.992 accuracy (same as unscaled one), giving us nearly perfect separation. In short, scaling dramatically reduces the dominance of PC1 (down from 63% to 19%) and distributes signal into higher PCs. As a result, the “best” discriminative axes shift - PC2 (rather than PC6) now carries enough of the hypoxia signal to pair with PC1.
SmartSeq HCC1806¶
Similarly, we begin with non-scaled:
data = ss_hcc_norm
X = data.T # cells × genes
adata_ss_hcc = ad.AnnData(X)
adata_ss_hcc.obs_names = data.columns # adding cell names
adata_ss_hcc.var_names = data.index # adding gene names
add_labels(adata_ss_hcc) # adding condition labels
sc.pp.pca(adata_ss_hcc, n_comps=50, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_hcc, n_pcs=50, plot=True)
# plot PCA
best_split(adata_ss_hcc, label_key='condition', max_pc=10)
Number of PCs needed to explain 95% variance: 34 PC1 explains: 29.02% PC2 explains: 18.10% PC3 explains: 12.29% PC4 explains: 7.97% PC5 explains: 4.96% PC6 explains: 3.64% PC7 explains: 2.74% PC8 explains: 2.11% PC9 explains: 1.74% PC10 explains: 1.32% Best pair of PCs: (2, 3) with score: 0.9450
The first 34 components capture ≥95% of the raw data’s variance. Best 2-PC split is achieved with PC2 and PC3 and 0.945 accuracy. PCs 2 & 3 together yield a clear, though slightly less perfect, separation of Hypo vs Norm cells compared to the MCF7 line. Here variance is less dominated by PC1 (29% vs. 63% before), and PCs 2 & 3 carry strong hypoxia signals. This suggests HCC1806 biology is more multifaceted: multiple axes beyond the first contribute meaningfully to condition differences.
Now the scaled option:
data = ss_hcc_norm
scaler = StandardScaler()
X_log = np.log1p(data.T) # cells × genes
X_scaled = scaler.fit_transform(X_log) # scaling the data
adata_ss_hcc_scaled = ad.AnnData(X_scaled)
adata_ss_hcc_scaled.obs_names = data.columns # adding cell names
adata_ss_hcc_scaled.var_names = data.index # adding gene names
add_labels(adata_ss_hcc_scaled) # adding condition labels
sc.pp.pca(adata_ss_hcc_scaled, n_comps=181, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ss_hcc_scaled, n_pcs=181)
# plot PCA
best_split(adata_ss_hcc_scaled, label_key='condition', max_pc=10)
Number of PCs needed to explain 95% variance: 161 PC1 explains: 4.50% PC2 explains: 3.23% PC3 explains: 2.56% PC4 explains: 1.97% PC5 explains: 1.55% PC6 explains: 1.46% PC7 explains: 1.31% PC8 explains: 1.21% PC9 explains: 1.07% PC10 explains: 0.99% Best pair of PCs: (2, 3) with score: 0.9012
Number of PCs for 95% variance is now 161, which means that equalizing per-gene variance scatters signal across many more dimensions.Even after scaling, PCs 2 & 3 remain the optimal discriminators, though accuracy drops modestly (from 0.945 to 0.9012).
DropSeq¶
DropSeq MCF7¶
The non-scaled analysis:
data = ds_mcf7_norm
X = data.T # cells × genes
print(X.shape)
adata_ds_mcf7 = ad.AnnData(X)
adata_ds_mcf7.obs_names = data.columns # adding cell names
adata_ds_mcf7.var_names = data.index # adding gene names
add_labels(adata_ds_mcf7) # adding condition labels
sc.pp.pca(adata_ds_mcf7, n_comps=800, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_mcf7, n_pcs=800, plot=True)
# plot PCA
best_split(adata_ds_mcf7, label_key='condition', max_pc=10)
(21626, 3000)
Number of PCs needed to explain 95% variance: 761 PC1 explains: 24.95% PC2 explains: 8.63% PC3 explains: 4.43% PC4 explains: 2.63% PC5 explains: 2.13% PC6 explains: 1.42% PC7 explains: 1.30% PC8 explains: 1.02% PC9 explains: 0.92% PC10 explains: 0.85% Best pair of PCs: (2, 3) with score: 0.8690
A very large number of components is required (761), because the raw DropSeq matrix is sparse and high-dimensional. PC2 and PC3 together give a decent but noisier separation compared to SmartSeq, with also a lower accuracy 0.869.
Now the scaled one:
data = ds_mcf7_norm
X_log = np.log1p(data.T) # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log) # scaling the data
adata_ds_mcf7_scaled = ad.AnnData(X_scaled)
adata_ds_mcf7_scaled.obs_names = data.columns # adding cell names
adata_ds_mcf7_scaled.var_names = data.index # adding gene names
add_labels(adata_ds_mcf7_scaled) # adding condition labels
sc.pp.pca(adata_ds_mcf7_scaled, n_comps=2999, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_mcf7_scaled, n_pcs=2999)
# plot PCA
best_split(adata_ds_mcf7_scaled, label_key='condition', max_pc=10)
Number of PCs needed to explain 95% variance: 2652 PC1 explains: 0.73% PC2 explains: 0.29% PC3 explains: 0.17% PC4 explains: 0.12% PC5 explains: 0.09% PC6 explains: 0.09% PC7 explains: 0.08% PC8 explains: 0.08% PC9 explains: 0.07% PC10 explains: 0.07% Best pair of PCs: (1, 3) with score: 0.9681
Unit-variance scaling massively flattens the variance curve, so almost every PC contributes a tiny amount, resulting in needing 2652 PCs to explain >95% variance. Separability, however, improves slightly after scaling, likely because it down-weights very high-variance (noisy) genes: the 2-PC best split PCs are PC1 & PC3 with accuracy 0.9681. In this run, PC1 carries a lot of the hypoxia signal (so pairing it with PC3 gives a cleaner separation than PC2 ever did).
DropSeq HCC1806¶
Last set of cells, we begin with non-scaled PCA:
data = ds_hcc_norm
X = data.T # cells × genes
adata_ds_hcc = ad.AnnData(X)
adata_ds_hcc.obs_names = data.columns # adding cell names
adata_ds_hcc.var_names = data.index # adding gene names
add_labels(adata_ds_hcc) # adding condition labels
sc.pp.pca(adata_ds_hcc, n_comps=900, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_hcc, n_pcs=900, plot=True)
# plot PCA
best_split(adata_ds_hcc, label_key='condition', max_pc=10)
Number of PCs needed to explain 95% variance: 844 PC1 explains: 8.42% PC2 explains: 6.32% PC3 explains: 3.88% PC4 explains: 2.87% PC5 explains: 2.38% PC6 explains: 2.16% PC7 explains: 1.67% PC8 explains: 1.48% PC9 explains: 1.44% PC10 explains: 1.25% Best pair of PCs: (5, 6) with score: 0.8087
The raw data require 844 components to reach 95% cumulative variance, reflecting widespread noise. From the 2-PC split we got 0.8087 accuracy: PCs 5 & 6 yield the clearest linear separation under no scaling, showing that subtle higher-order components contain the hypoxia signal.
Continue with scaled:
data = ds_hcc_norm
X_log = np.log1p(data.T) # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log) # scaling the data
adata_ds_hcc_scaled = ad.AnnData(X_scaled)
adata_ds_hcc_scaled.obs_names = data.columns # adding cell names
adata_ds_hcc_scaled.var_names = data.index # adding gene names
add_labels(adata_ds_hcc_scaled) # adding condition labels
sc.pp.pca(adata_ds_hcc_scaled, n_comps=2999, svd_solver='arpack', use_highly_variable=False)
get_top_components(adata_ds_hcc_scaled, n_pcs=2999, plot=True)
# plot PCA
best_split(adata_ds_hcc_scaled, label_key='condition', max_pc=10)
Number of PCs needed to explain 95% variance: 2595 PC1 explains: 0.39% PC2 explains: 0.27% PC3 explains: 0.17% PC4 explains: 0.15% PC5 explains: 0.09% PC6 explains: 0.09% PC7 explains: 0.08% PC8 explains: 0.08% PC9 explains: 0.08% PC10 explains: 0.08% Best pair of PCs: (3, 4) with score: 0.8932
Unit-variance scaling flattens the variance curve almost completely, so nearly every PC contributes a sliver. This re-weighting sharpens separation (≈89% accuracy) compared to the unscaled case, giving the best 2-PC split with PC3 & PC4.
DropSeq issue¶
When we ran PCA on the full (Drop-seq) dataset and asked for enough components to cover 95% of the variance, the algorithm returned thousands of PCs. In practice, those later dimensions:
- Capture very low signal-to-noise ratios (technical noise, drop-out events)
- Tend to drown out the biologically meaningful structure when fed into UMAP or t-SNE
- Greatly increase computation time and destabilize embeddings
Instead of a blanket “95% variance” cutoff, we’ll now use the scree plot (per-PC variance curve) to pick the elbow point—the PC after which each additional axis contributes only vanishing gains. This approach:
- Denoises by discarding the long tail of tiny, likely artifactual components
- Speeds up UMAP/t-SNE and yields more reproducible layouts
- Focuses on the axes that capture true biological variation (cell-state differences, treatment effects)
Below we generate the scree & cumulative variance plots, detect the elbow and run our new PCA before proceeding to UMAP/t-SNE on those top components.
sc.tl.pca(adata_ds_mcf7_scaled, n_comps=200, svd_solver='full')
evr = adata_ds_mcf7_scaled.uns['pca']['variance_ratio'] # shape
cumvar = np.cumsum(evr) # cumulative sum
pcs = np.arange(1, len(evr) + 1) # [1,2,3,...]
fig, ax1 = plt.subplots(figsize=(6,4))
ax1.plot(pcs, evr, '-o', label='per‐PC var.')
ax1.set_xlabel('PC number'); ax1.set_ylabel('Explained variance ratio')
ax1.axvline(20, color='gray', linestyle='--', alpha=0.5)
ax2 = ax1.twinx()
ax2.plot(pcs, cumvar, '-s', c='C1', label='cumulative var.')
ax2.set_ylabel('Cumulative variance')
ax1.legend(loc='upper left'); ax2.legend(loc='lower right')
plt.title('Scree & cumulative variance')
plt.show()
sc.tl.pca(adata_ds_hcc_scaled, n_comps=200, svd_solver='full')
evr = adata_ds_hcc_scaled.uns['pca']['variance_ratio'] # shape
cumvar = np.cumsum(evr) # cumulative sum
pcs = np.arange(1, len(evr) + 1) # [1,2,3,...]
fig, ax1 = plt.subplots(figsize=(6,4))
ax1.plot(pcs, evr, '-o', label='per‐PC var.')
ax1.set_xlabel('PC number'); ax1.set_ylabel('Explained variance ratio')
ax1.axvline(20, color='gray', linestyle='--', alpha=0.5)
ax2 = ax1.twinx()
ax2.plot(pcs, cumvar, '-s', c='C1', label='cumulative var.')
ax2.set_ylabel('Cumulative variance')
ax1.legend(loc='upper left'); ax2.legend(loc='lower right')
plt.title('Scree & cumulative variance')
plt.show()
From our scree‐plots:
- MCF-7: the per-PC variance curve flattens out around PC 4–5
- HCC1806: the elbow appears near PC 5–6
To remain conservative (i.e. not risk dropping any real biological signal) and keep our workflow consistent across both datasets, we will use 10 PCs for all downstream steps (UMAP, t-SNE, clustering). This ensures:
- Coverage of all clearly informative axes (including a small safety buffer beyond the elbow)
- Robustness against dataset-specific noise peaks
- Cohesion in parameter choice across MCF-7 and HCC1806
Below, we re-run PCA with n_comps=10 and proceed to UMAP/t-SNE on those top 10 components.
data = ds_mcf7_norm
X_log = np.log1p(data.T) # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log) # scaling the data
adata_ds_mcf7_scaled = ad.AnnData(X_scaled)
adata_ds_mcf7_scaled.obs_names = data.columns # adding cell names
adata_ds_mcf7_scaled.var_names = data.index # adding gene names
add_labels(adata_ds_mcf7_scaled) # adding condition labels
sc.pp.pca(adata_ds_mcf7_scaled, n_comps=10, svd_solver='arpack', use_highly_variable=False)
data = ds_hcc_norm
X_log = np.log1p(data.T) # cells × genes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_log) # scaling the data
adata_ds_hcc_scaled = ad.AnnData(X_scaled)
adata_ds_hcc_scaled.obs_names = data.columns # adding cell names
adata_ds_hcc_scaled.var_names = data.index # adding gene names
add_labels(adata_ds_hcc_scaled) # adding condition labels
sc.pp.pca(adata_ds_hcc_scaled, n_comps=10, svd_solver='arpack', use_highly_variable=False)
Results summary¶
Empirically, we compared PCA with and without scaling across four dataset‐condition combinations (SmartSeq vs. DropSeq, MCF7 vs. HCC1806) and found that, despite concerns over unit‐variance scaling amplifying technical noise, the unscaled PCA still captures hypoxia‐related variation robustly in its leading components. Overall, scaling tends to:
- Spread variance more evenly across many components (e.g. SmartSeq MCF7 rises from 20 → 204 PCs for 95% variance).
- Shift the optimal separation axes (e.g. MCF7 switches from PC(1,6) unscaled to PC(1,2) scaled).
- Modestly affect classification accuracy: in the noisier DropSeq datasets, scaling actually improves separability (MCF7: 0.87 → 0.90; HCC1806: 0.81 → 0.89), whereas in SmartSeq lines accuracy remains near‐perfect (0.99).
summary_df = pd.DataFrame([
{"Dataset": "SmartSeq MCF7", "Scaling": "Unscaled", "PCs for 95% var": 20, "Best PC pair": "1,6", "Accuracy": 0.992},
{"Dataset": "SmartSeq MCF7", "Scaling": "Scaled", "PCs for 95% var": 204, "Best PC pair": "1,2", "Accuracy": 0.992},
{"Dataset": "SmartSeq HCC1806", "Scaling": "Unscaled", "PCs for 95% var": 34, "Best PC pair": "2,3", "Accuracy": 0.945},
{"Dataset": "SmartSeq HCC1806", "Scaling": "Scaled", "PCs for 95% var": 161, "Best PC pair": "2,3", "Accuracy": 0.901},
{"Dataset": "DropSeq MCF7", "Scaling": "Unscaled", "PCs for 95% var": 761, "Best PC pair": "2,3", "Accuracy": 0.869},
{"Dataset": "DropSeq MCF7", "Scaling": "Scaled", "PCs for 95% var": 2652, "Best PC pair": "2,3", "Accuracy": 0.901},
{"Dataset": "DropSeq HCC1806", "Scaling": "Unscaled", "PCs for 95% var": 844, "Best PC pair": "5,6", "Accuracy": 0.809},
{"Dataset": "DropSeq HCC1806", "Scaling": "Scaled", "PCs for 95% var": 2595, "Best PC pair": "3,4", "Accuracy": 0.893},
])
summary_df
| Dataset | Scaling | PCs for 95% var | Best PC pair | Accuracy | |
|---|---|---|---|---|---|
| 0 | SmartSeq MCF7 | Unscaled | 20 | 1,6 | 0.992 |
| 1 | SmartSeq MCF7 | Scaled | 204 | 1,2 | 0.992 |
| 2 | SmartSeq HCC1806 | Unscaled | 34 | 2,3 | 0.945 |
| 3 | SmartSeq HCC1806 | Scaled | 161 | 2,3 | 0.901 |
| 4 | DropSeq MCF7 | Unscaled | 761 | 2,3 | 0.869 |
| 5 | DropSeq MCF7 | Scaled | 2652 | 2,3 | 0.901 |
| 6 | DropSeq HCC1806 | Unscaled | 844 | 5,6 | 0.809 |
| 7 | DropSeq HCC1806 | Scaled | 2595 | 3,4 | 0.893 |
Across all conditions, the hypoxia signature consistently emerges in the first handful of unscaled PCs - even in sparse DropSeq - confirming that PCA alone can recover our phenotype without fancy normalization. Scaling can unearth subtler axes in noisy data, but at the cost of inflating minor technical variance. Analizing also the DropSeq issue we encountered, we can safely impose using only 10 PCs for the downstream analysis of those two datasets.
Data & Parameter Choices for Downstream Analysis
Smart-seq
We will proceed unscaled (no zero-centring/unit-variance) since this data was already normalized and yields robust results without additional scaling.Drop-seq
We will use StandardScaler (zero-centering and unit-variance scaling) followed by PCA with n_comps=10, as described above, to denoise and stabilize our embeddings.
K-NN graph¶
Building a k-NN graph is a necessary preprocessing step for graph-based dimensionality reduction methods like UMAP and t-SNE, which leverage local cell neighborhoods.
We use Scanpy’s pp.neighbors() function and focus on three key parameters:
n_pcs: Number of principal components from PCA used as the embedding space for neighbor calculations. We will not pass it directly, rather accessing theX_pca_95obsm wrapped in the adata object for the Smart Seq Datasets, andX_pcafor Drop Seq ones.n_neighbors: Number of nearest neighbors per cell to include in the graph. A common heuristic is to set
$$n\_neighbors \approx \sqrt{N},$$
where (N) is the total number of cells in the dataset .metric: Distance metric for computing pairwise cell distances in PC space, Euclidean is the standard and the one we will use in our implementation.
functions¶
# printing the shape of the datasets to look at the number of cells and genes
print(ds_hcc_norm.shape,
ds_mcf7_norm.shape,
ss_hcc_norm.shape,
ss_mcf7_norm.shape)
(3000, 14682) (3000, 21626) (3000, 182) (3000, 250)
def build_and_diagnose_knn(
adata,
n_neighbors,
metric="euclidean",
use_rep="X_pca_95",
random_state=42
):
"""
Build a k-NN graph on adata.obsm['X_pca'] and then produce:
1) degree‐distribution histogram
2) adjacency heatmap for a random subset of cells
3) spring‐layout network plot for that subset
Parameters
----------
adata : AnnData
Must have PCA in adata.obsm['X_pca'].
n_neighbors : int
k for the k-NN graph.
metric : str
Distance metric for k-NN.
random_state : int
Seed for reproducibility.
Returns
-------
None
Additionally it adds the following to the adata object:
adata.uns['neighbors']['degree_distribution']: degree distribution of the k-NN graph
adata.obsm["connectivities"]: connectivity matrix of the k-NN graph
"""
if use_rep is None:
use_rep = "X_pca_95" # default representation for k-NN graph
# Check if elbow point is already computed
# 1) build the graph
sc.pp.neighbors(
adata,
n_neighbors=n_neighbors,
metric='euclidean',
method='umap',
knn=True,
use_rep=f"{use_rep}",
random_state=random_state
)
MCF7 Smart Seq¶
build_and_diagnose_knn(
adata_ss_mcf7,
n_neighbors=int(np.sqrt(250)), # sqrt(250) = 15.81
metric="euclidean",
random_state=42
)
HCC1806 Smart Seq¶
build_and_diagnose_knn(
adata_ss_hcc,
n_neighbors=int(np.sqrt(ss_hcc_norm.shape[1])), #sqrt(181) = 13.45
metric="euclidean",
random_state=42
)
MCF7 Drop Seq¶
build_and_diagnose_knn(
adata_ds_mcf7_scaled,
n_neighbors=int(np.sqrt(ds_mcf7_norm.shape[1])), # sqrt(21626) = 147.0
metric="euclidean",
use_rep="X_pca",
random_state=42
)
HCC11806 Dropseq¶
build_and_diagnose_knn(
adata_ds_hcc_scaled,
n_neighbors=int(np.sqrt(ds_hcc_norm.shape[1])), # sqrt(14682) = 121.2
use_rep="X_pca",
metric="euclidean",
random_state=42
)
t-SNE¶
t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction method. It is used only for visualization, not for training models. It projects high-dimensional data (e.g. 3000 genes) into 2 or 3 dimensions while preserving local structure — meaning that similar cells stay close together. It works by converting pairwise similarities into probabilities and minimizing the Kullback–Leibler divergence between high and low-dimensional distributions.
In our experiments, t-SNE reveals strong separation in datasets, with hypoxic and normoxic cells forming distinct clusters.
t-SNE’s effectiveness is highly sensitive to its parameters. Perplexity controls the balance between local and global aspects of the data (similar to the number of nearest neighbors considered), and different values can yield drastically different embeddings. As a standard choice we used 30 but then tried different parameter for every dataset in order to get the best visual split.
functions¶
def t_sne(adata, title="", perplexity=30, use_rep="X_pca_95", random_state=42):
"""
Run t-SNE on the PCA-reduced data and plot the results.
Parameters
----------
adata : AnnData
The AnnData object containing PCA-reduced data.
title : str
Title for the plot.
perplexity : int
Perplexity parameter for t-SNE.
Returns
-------
adata : AnnData
The AnnData object with t-SNE coordinates added.
"""
sc.tl.tsne(
adata,
use_rep=use_rep,
perplexity=perplexity,
n_pcs=adata.uns["pca"]["n_components_95"] if use_rep == "X_pca_95" else None,
random_state=42)
sc.pl.tsne(
adata,
color='condition',
show=False,
size=10,
title=f"t-SNE: {title}"
)
return adata
MCF7 Smart Seq¶
The t-SNE projection of the raw (unscaled) MCF7 Smart-Seq dataset reveals two distinct regions in the embedding space, corresponding to cells under hypoxia and normoxia conditions. Notably, two cells appear closer to the normoxia region, deviating slightly from the expected separation.
t_sne(adata_ss_mcf7, title="MCF7 Smart Seq unscaled", perplexity=50)
AnnData object with n_obs × n_vars = 250 × 3000
obs: 'condition'
uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
obsm: 'X_pca', 'X_pca_95', 'X_tsne'
varm: 'PCs'
obsp: 'distances', 'connectivities'
HCC1806 Smart Seq¶
Also for HCC, two regions seem to emerge, though they are not as distinct or well-defined as in the other cell line. Notably, we had to manually adjust the perplexity parameter to 20(instead of standard 30) to achieve a visually interpretable result.
t_sne(adata_ss_hcc, title="HCC1806 Smart Seq unscaled", perplexity=20)
AnnData object with n_obs × n_vars = 182 × 3000
obs: 'condition'
uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
obsm: 'X_pca', 'X_pca_95', 'X_tsne'
varm: 'PCs'
obsp: 'distances', 'connectivities'
MCF7 Drop-Seq¶
The effect of preprocessing (log-transforming and scaling the dataset) and reducing the number of principal components to just 10 revealed two distinct subgroups in the t-SNE embedding, corresponding to the biological conditions. This improvement is due to t-SNE's sensitivity to the number of input dimensions; using more than 1000 dimensions would result in a noisy and uninterpretable plot.
t_sne(adata_ds_mcf7_scaled, title="MCF7 Drop Seq scaled - using only 10 components", perplexity=50, use_rep="X_pca", random_state=42)
AnnData object with n_obs × n_vars = 21626 × 3000
obs: 'condition'
uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
obsm: 'X_pca', 'X_tsne'
varm: 'PCs'
obsp: 'distances', 'connectivities'
HCC Drop-Seq¶
The t-SNE embedding of the scaled HCC Drop-Seq dataset reveals two distinct subgroups corresponding to the biological conditions. For the same reason as before, this separation is achieved only after preprocessing the data (log-transforming and scaling) and reducing the number of principal components to 10.
t_sne(adata_ds_hcc_scaled, title="HCC1806 Drop Seq scaled", perplexity=50, use_rep="X_pca")
AnnData object with n_obs × n_vars = 14682 × 3000
obs: 'condition'
uns: 'pca', 'neighbors', 'tsne', 'condition_colors'
obsm: 'X_pca', 'X_tsne'
varm: 'PCs'
obsp: 'distances', 'connectivities'
UMAP¶
UMAP (Uniform Manifold Approximation and Projection) is a powerful non-linear dimensionality reduction technique that captures both local neighborhood structure and global data topology. In our single-cell analysis, we use UMAP to embed high-dimensional gene expression profiles into two dimensions, making it easier to visualize and interpret cell clusters.
To compute the UMAP embedding using the scanpy function sc.tl.umap, we need first to have a k-nn graph that determines n_neighbors, defining how many neighbors each cell considers when building the graph. This cell is meant to be run after the K-NN one.
The min_dist parameter controls how tightly UMAP packs points together in the low-dimensional space. The default value of 0.5 provides a good representation of the conditions in the embedded space, so we left this stadnard choice.
MCF7 SmartSeq
The UMAP embedding closely mirrors the patterns observed in the t-SNE analysis. Using the unscaled data, there is a near-perfect separation into two distinct regions, except for two points, which aligns exactly with the t-SNE results.
sc.tl.umap(
adata_ss_mcf7,
min_dist=0.5,
random_state=42
)
sc.pl.umap(
adata_ss_mcf7,
color='condition',
show=False,
size=20,
title="UMAP: MCF7 Smart-seq unscaled"
)
<Axes: title={'center': 'UMAP: MCF7 Smart-seq unscaled'}, xlabel='UMAP1', ylabel='UMAP2'>
HCC SmartSeq¶
Consistent with the PCA and t-SNE results, the visual separation is less pronounced compared to the MCF7 cell line. However, a clear gradient is still observable, indicating some level of differentiation between conditions.
sc.tl.umap(
adata_ss_hcc,
min_dist=0.5,
random_state=42
)
sc.pl.umap(
adata_ss_hcc,
color='condition',
show=False,
size=20,
title="UMAP: HCC1806 Smart-seq unscaled"
)
<Axes: title={'center': 'UMAP: HCC1806 Smart-seq unscaled'}, xlabel='UMAP1', ylabel='UMAP2'>
MCF7 DropSeq¶
UMAP embedding reveals interesting structures in the preprocessed version of the data. Notably, UMAP has a superior ability to preserve the global and local structure of the data, even when working with a high number of components. To maintain consistency with the neighbors graph constructed earlier, the analysis here is still based on the first 10 components.
#scaled version
sc.tl.umap(
adata_ds_mcf7_scaled,
min_dist=0.5,
random_state=42
)
sc.pl.umap(
adata_ds_mcf7_scaled,
color='condition',
show=False,
size=10,
title="UMAP: MCF7 Drop-seq scaled"
)
<Axes: title={'center': 'UMAP: MCF7 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>
HCC DropSeq¶
For this cell line, the interpretability of the plot is less clear compared to MCF7. However, it appears that hypoxic cells are positioned above and below a stripe of cells labeled as normoxic, suggesting some level of separation between the conditions.
# scaled version
sc.tl.umap(
adata_ds_hcc_scaled,
min_dist=0.5,
random_state=42
)
sc.pl.umap(
adata_ds_hcc_scaled,
color='condition',
show=False,
size=10,
title="UMAP: HCC1806 Drop-seq scaled"
)
<Axes: title={'center': 'UMAP: HCC1806 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>
K-Means Clustering¶
To choose an appropriate number of clusters (k), we applied K-means to the PCA embedding that captures 95 % of the variance. We evaluated both inertia and average silhouette scores for various parameters of k. In most cases the peak value happens to be k = 2, which is what we would expect. In some cases other values looked plausible so we decided to plot them as well. In the end we evaluate clusters comparing them to ground-truth labels using metrics like the ARI, NMI and cluster purity.
When we project those two clusters back into our principal components (PC1 vs. PC2, etc.), and into UMAP or t-SNE space, the resulting partition cleanly separates hypoxic from normoxic cells only in the MCF-7 SmartSeq dataset. In other cell-lines/sampling methods, the clusters overlap substantially and fail to track our desired condition or there is the need of more than 2 clusters to separate hypoxia an normoxia.
This uniquely clear split in the MCF-7 SmartSeq data likely reflects a combination of cell-line consistency under hypoxia and the high sensitivity of the SmartSeq protocol in capturing those changes.
helper functions¶
This cell defines three key helper functions for performing and visualizing K-Means clustering on single-cell data, typically after dimensionality reduction (e.g., PCA):
kmeans_optimization
Purpose:
RunsK-Meansclustering on a chosen data representationvisualorrep_key(e.g., the first N principal components), for a range of cluster numbers (k_range).
What it does:- Fits
K-Meansfor each value ofkin the specified range. - Computes and stores the inertia (within-cluster sum of squares) and silhouette score (a measure of cluster separation) for each
k. - Identifies the best
k(the one with the highest silhouette score). - Stores the clustering results and labels in the
AnnDataobject. - Plots the inertia ("elbow plot") and silhouette scores to help visually select the optimal number of clusters.
- Fits
silhouette_diagrams
Purpose:
Visualizes the quality of clustering for different values ofkusing silhouette plots.
What it does:- For each
k, computes the silhouette coefficient for every cell (how well each cell fits within its cluster). - Plots the silhouette diagram for each
k, showing the distribution of silhouette scores per cluster. - Helps assess which
kyields the most coherent and well-separated clusters.
- For each
plot_kmeans_clusters
Purpose:
Visualizes the clustering results in low-dimensional space (UMAP,t-SNE, orPCA).
What it does:- Plots the best
K-Means clustering(by passing manually the parameterk, default is the maximizer of the silhouette score) side-by-side with the ground-truth biological condition (hypoxia/normoxia). - Supports visualization in
UMAP,t-SNE, orPCAspace. - Allows direct comparison between unsupervised clusters and known biological labels.
- Plots the best
evaluate_clustering
Purpose:
Evaluates clustering results against ground-truth labels using external metrics and contingency tables.
What it does:- Computes Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) to quantify clustering quality.
- Generates a contingency table showing the overlap between true labels and cluster assignments.
- Calculates row-wise and column-wise percentages for the contingency table.
- Computes purity per cluster and overall purity as additional metrics.
- Prints all results for easy interpretation.
def kmeans_optimization(
adata,
rep_key: str = 'X_pca_95',
visual: str = 'pca',
k_range: range = range(2, 10),
random_state: int = 42
):
"""
Optimize KMeans on a given embedding, store results in adata.uns and adata.obs.
Parameters
----------
adata: AnnData
Must contain adata.obsm[rep_key] for clustering.
rep_key: str
Key in adata.obsm to cluster on (e.g. 'X_pca', 'X_pca_95', 'X_umap').
visual: str
Identifier under which results will be stored in adata.uns['kmeans'].
k_range: range
Range of k values to evaluate (n_clusters).
random_state: int
Random seed for reproducibility.
Effects
-------
- Populates adata.uns['kmeans'][visual] = {
'k_range': list(k_range),
'inertia': [...],
'silhouette': [...],
'best_k': int,
'best_score': float,
'labels_key': str
}
- Stores best KMeans labels in adata.obs under key provided by 'labels_key'.
"""
# prepare storage
if 'kmeans' not in adata.uns:
adata.uns['kmeans'] = {}
results = {}
# extract data for clustering
X = adata.obsm.get(rep_key)
if X is None:
raise KeyError(f"adata.obsm['{rep_key}'] not found")
inertias = []
silhouettes = []
models = []
# fit and evaluate
for k in k_range:
model = KMeans(n_clusters=k, n_init=10, random_state=random_state).fit(X)
inertias.append(model.inertia_)
if k > 1:
silhouettes.append(silhouette_score(X, model.labels_))
else:
silhouettes.append(np.nan)
models.append(model)
# determine best k by silhouette
silhouettes_np = np.array(silhouettes)
# ignore first nan
best_idx = np.nanargmax(silhouettes_np)
best_k = k_range[best_idx]
best_score = silhouettes_np[best_idx]
best_labels = models[best_idx].labels_.astype(str)
labels_key = f'kmeans_{visual}'
# store in adata
results['k_range'] = list(k_range)
results['inertia'] = inertias
results['silhouette'] = silhouettes
results['best_k'] = int(best_k)
results['best_score'] = float(best_score)
results['labels_key'] = labels_key
adata.uns['kmeans'][visual] = results
adata.obs[labels_key] = best_labels
# plot inertia & silhouette
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
axs[0].plot(list(k_range), inertias, 'bo-')
axs[0].set_xlabel('k'), axs[0].set_ylabel('Inertia'), axs[0].set_title(f'Elbow Plot ({visual})')
axs[1].plot(list(k_range), silhouettes, 'bo-')
axs[1].set_xlabel('k'), axs[1].set_ylabel('Silhouette'), axs[1].set_title(f'Silhouette Scores ({visual})')
plt.tight_layout()
plt.show()
return best_k
def silhouette_diagrams(
adata,
rep_key: str = 'X_pca_95',
visual: str = 'pca',
k_range: range = range(2, 7),
dataset_name: str = 'Dataset'
):
"""
Plot silhouette diagrams for KMeans clusters on a given embedding.
Parameters
----------
adata: AnnData
Must contain adata.obsm[rep_key].
rep_key: str
Key in adata.obsm for clustering.
visual: str
Identifier used for titles and result storage.
k_range: range
Values of k to evaluate.
dataset_name: str
Name used in subplot titles.
Effects
-------
- Stores silhouette_scores dict in adata.uns['kmeans'][visual]['silhouette_details']
"""
X = adata.obsm.get(rep_key)
if X is None:
raise KeyError(f"adata.obsm['{rep_key}'] not found")
k_list = list(k_range)
models = [KMeans(n_clusters=k, n_init=10, random_state=42).fit(X) for k in k_list]
sil_scores = [silhouette_score(X, m.labels_) for m in models]
# Prepare plots
n_plots = len(k_list)
n_cols = 3
n_rows = math.ceil(n_plots / n_cols)
plt.figure(figsize=(5 * n_cols, 4 * n_rows))
details = {}
for idx, (k, model) in enumerate(zip(k_list, models)):
ax = plt.subplot(n_rows, n_cols, idx + 1)
labels = model.labels_
coeffs = silhouette_samples(X, labels)
details[k] = coeffs
padding = len(X) // 30
pos = padding
ticks = []
for cluster in range(k):
c_vals = np.sort(coeffs[labels == cluster])
ax.fill_betweenx(
np.arange(pos, pos + len(c_vals)),
0, c_vals, alpha=0.7
)
ticks.append(pos + len(c_vals)/2)
pos += len(c_vals) + padding
ax.yaxis.set_major_locator(FixedLocator(ticks))
ax.yaxis.set_major_formatter(FixedFormatter(range(k)))
ax.axvline(x=sil_scores[idx], color='red', linestyle='--')
ax.set_title(f"{dataset_name} — k={k}")
if idx % n_cols == 0:
ax.set_ylabel('Cluster')
if idx >= (n_rows-1)*n_cols:
ax.set_xlabel('Silhouette Coefficient')
else:
plt.setp(ax.get_xticklabels(), visible=False)
plt.tight_layout()
plt.show()
# store details
adata.uns['kmeans'][visual]['silhouette_details'] = details
return dict(zip(k_list, sil_scores))
def plot_kmeans_clusters(
adata,
k: int,
rep_key: str,
embed: str = 'umap',
embed_key: str = "",
pca_dims: tuple = (0, 1),
random_state: int = 42,
size: int = 10,
dataset_name: str = "Dataset",
consistent_colors: bool = True
):
"""
Plot KMeans clusters for a single user-specified k value and ground truth conditions side-by-side on UMAP/TSNE or PCA.
Parameters
----------
adata: AnnData
Annotated data matrix.
k: int
Number of clusters for KMeans.
rep_key: str
Key in adata.obsm to cluster on.
embed: str
Embedding type ('umap', 'tsne', or 'pca').
embed_key: str
Key in adata.obsm for embedding coordinates.
pca_dims: tuple
Dimensions to use for PCA plots.
random_state: int
Random seed for reproducibility.
size: int
Marker size for scatter plots.
dataset_name: str
Name of the dataset to include in plot titles.
consistent_colors: bool
Whether to use consistent colors across plots for clusters.
"""
embed_key = embed_key or f'X_{embed}'
X = adata.obsm.get(rep_key)
if X is None:
raise KeyError(f"adata.obsm['{rep_key}'] not found")
# Run KMeans with user-specified k
model = KMeans(n_clusters=k, n_init=10, random_state=random_state).fit(X)
labels_key = f'kmeans_k{k}'
adata.obs[labels_key] = model.labels_.astype(str)
if labels_key not in adata.obs:
raise KeyError(f"adata.obs['{labels_key}'] not found")
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
if embed in ('umap', 'tsne'):
coords = adata.obsm.get(embed_key)
if coords is None:
raise KeyError(f"adata.obsm['{embed_key}'] not found")
# left: KMeans clusters
ax = axes[0]
sc.pl.embedding(
adata,
basis=embed,
color=[labels_key],
title=f'{dataset_name} — {embed.upper()} KMeans (k={k})',
show=False,
ax=ax,
size=size,
palette="tab20" if consistent_colors else None,
)
# right: true condition
ax = axes[1]
sc.pl.embedding(
adata,
basis=embed,
color=['condition'],
title=f'{dataset_name} — {embed.upper()} True Condition',
show=False,
ax=ax,
size=size,
)
elif embed == 'pca':
pcs = adata.obsm.get(rep_key)
if pcs is None:
raise KeyError(f"adata.obsm['{rep_key}'] not found for PCA plot")
x, y = pca_dims
# left: KMeans
ax = axes[0]
scatter = ax.scatter(
pcs[:, x], pcs[:, y],
c=adata.obs[labels_key].astype(int),
cmap='Set2' if consistent_colors else 'viridis',
s=size, alpha=0.8
)
ax.set_xlabel(f'PC{x+1}'), ax.set_ylabel(f'PC{y+1}')
ax.set_title(f'{dataset_name} — PCA KMeans (k={k}) — PCs {x+1} vs {y+1}')
handles, _ = scatter.legend_elements()
ax.legend(handles, [f'Cluster {i}' for i in range(k)], title='Cluster')
# right: true condition
ax = axes[1]
scatter = ax.scatter(
pcs[:, x], pcs[:, y],
c=adata.obs['condition'].astype('category').cat.codes,
cmap='Set2', s=size, alpha=0.8
)
ax.set_xlabel(f'PC{x+1}'), ax.set_ylabel(f'PC{y+1}')
ax.set_title(f'{dataset_name} — PCA True Condition')
handles, _ = scatter.legend_elements()
ax.legend(handles, adata.obs['condition'].cat.categories, title='Condition')
else:
raise ValueError("embed must be 'umap', 'tsne', or 'pca'")
plt.tight_layout()
plt.show()
def evaluate_clustering(true_labels, cluster_labels, method_name="Clustering"):
"""
Evaluate clustering against ground-truth labels, including confusion matrix
with both raw counts and row-/column-wise percentages.
Parameters
----------
true_labels : array-like
Ground-truth class labels.
cluster_labels : array-like
Cluster assignments.
method_name : str
Name of the clustering method for printouts.
Returns
-------
results : dict
{
'ARI': float,
'NMI': float,
'contingency': pd.DataFrame,
'row_pct': pd.DataFrame,
'col_pct': pd.DataFrame,
'purity_per_cluster': pd.Series,
'overall_purity': float
}
"""
# External metrics
ari = adjusted_rand_score(true_labels, cluster_labels)
nmi = normalized_mutual_info_score(true_labels, cluster_labels)
# Contingency table
ct = pd.crosstab(
pd.Series(true_labels, name="True"),
pd.Series(cluster_labels, name="Cluster")
)
# Percentages
row_pct = ct.div(ct.sum(axis=1), axis=0) * 100
col_pct = ct.div(ct.sum(axis=0), axis=1) * 100
# Purity calculations
purity_per_cluster = ct.max(axis=0) / ct.sum(axis=0)
overall_purity = ct.values.max(axis=1).sum() / ct.values.sum()
# Print results
print(f"\n=== {method_name} Evaluation ===")
print(f"ARI: {ari:.4f} NMI: {nmi:.4f}\n")
print("Contingency Table (raw counts):")
print(ct, "\n")
print("Row-wise percentages (each true class → clusters):")
print(row_pct.round(1).astype(str) + "%", "\n")
print("Column-wise percentages (each cluster ← true classes):")
print(col_pct.round(1).astype(str) + "%", "\n")
print("Purity per cluster:")
print(purity_per_cluster.to_frame(name="Purity"), "\n")
print(f"Overall purity: {overall_purity:.4f}\n")
# Return everything for further inspection
return {
'ARI': ari,
'NMI': nmi,
'contingency': ct,
'row_pct': row_pct,
'col_pct': col_pct,
'purity_per_cluster': purity_per_cluster,
'overall_purity': overall_purity
}
MCF7 Smart Seq¶
The plots clearly show that the silhouette score peaks at two clusters. The silhouette diagrams further confirm this by displaying well-balanced cluster sizes with minimal negative silhouette values, indicating a strong and consistent clustering structure.
While an elbow might be observed at higher values of k, it is unnecessary to consider them, as the cluster-to-condition comparison for k = 2 across PCA, UMAP, and t-SNE spaces already demonstrates an almost perfect match for both scaled and unscaled version.
With k = 2 Kmeans clustering achieved a level of overall purity of 0.9720, which is an almost pefect result.
kmeans_optimization(adata_ss_mcf7, visual="pca_95")
silhouette_diagrams(adata_ss_mcf7, k_range=range(2, 7), dataset_name="Smart-seq MCF7", visual="pca_95")
{2: np.float32(0.49936017),
3: np.float32(0.4744652),
4: np.float32(0.46488273),
5: np.float32(0.4413419),
6: np.float32(0.42580906)}
plot_kmeans_clusters(
adata_ss_mcf7,
k = 2,
embed='umap',
rep_key='X_pca_95',
size=30,
dataset_name="Smart-seq MCF7"
)
print("=========================")
plot_kmeans_clusters(
adata_ss_mcf7,
k = 2,
embed='tsne',
rep_key='X_pca_95',
size=30,
dataset_name="Smart-seq MCF7"
)
print("=========================")
plot_kmeans_clusters(
adata_ss_mcf7,
k = 2,
embed='pca',
rep_key='X_pca_95',
pca_dims=(0, 5),
size=30,
dataset_name="Smart-seq MCF7"
)
=========================
=========================
evaluate_clustering(
adata_ss_mcf7.obs['condition'],
adata_ss_mcf7.obs['kmeans_k2'],
method_name="Smart-seq MCF7 KMeans (k=2)"
)
print("\n")
=== Smart-seq MCF7 KMeans (k=2) Evaluation ===
ARI: 0.8907 NMI: 0.8430
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 117 7
Norm 0 126
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 94.4% 5.6%
Norm 0.0% 100.0%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 100.0% 5.3%
Norm 0.0% 94.7%
Purity per cluster:
Purity
Cluster
0 1.000000
1 0.947368
Overall purity: 0.9720
HCC Smart Seq¶
Although the mean silhouette score reaches its maximum at k = 2, the detailed silhouette plots for k = 3–6 show more uniformly high widths across all clusters. This suggests that beyond a simple hypoxic vs. normoxic dichotomy, the HCC data may harbor three or more distinct subgroups—potentially corresponding to different hypoxia responses or other biological states within each condition.
The evaluation of these clusters is very poor for k = 2, with ARI: -0.0042 and NMI: 0.0000. The results improved for k = 3 (ARI: 0.5161, NMI: 0.4844), but the third cluster seems to have captured cells that lie on the border of the two conditions. It might represent biologically cells that are in a transition state and have not fully developed hypoxia yet.
kmeans_optimization(adata_ss_hcc, visual="pca_95")
silhouette_diagrams(adata_ss_hcc, k_range=range(2, 8), dataset_name="Smart-seq HCC1806", visual="pca_95")
{2: np.float32(0.2634671),
3: np.float32(0.16170451),
4: np.float32(0.16963938),
5: np.float32(0.18061545),
6: np.float32(0.190826),
7: np.float32(0.17456625)}
plot_kmeans_clusters(
adata_ss_hcc,
rep_key='X_pca_95',
k = 2,
embed='tsne',
size=50,
dataset_name="Smart-seq HCC1806"
)
print("==========================")
plot_kmeans_clusters(
adata_ss_hcc,
rep_key='X_pca_95',
k = 2,
embed='pca',
pca_dims=(1, 2),
size=30,
dataset_name="Smart-seq HCC1806"
)
==========================
evaluate_clustering(
adata_ss_hcc.obs['condition'],
adata_ss_hcc.obs['kmeans_k2'],
method_name="KMeans Clustering (HCC) k=2"
)
print("\n")
=== KMeans Clustering (HCC) k=2 Evaluation ===
ARI: -0.0042 NMI: 0.0000
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 28 69
Norm 25 60
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 28.9% 71.1%
Norm 29.4% 70.6%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 52.8% 53.5%
Norm 47.2% 46.5%
Purity per cluster:
Purity
Cluster
0 0.528302
1 0.534884
Overall purity: 0.7088
plot_kmeans_clusters(
adata_ss_hcc,
rep_key='X_pca_95',
k = 3,
embed='umap',
size=50,
dataset_name="Smart-seq HCC1806"
)
print("==========================")
plot_kmeans_clusters(
adata_ss_hcc,
rep_key='X_pca_95',
k = 3,
embed='pca',
pca_dims=(1, 2),
size=30,
dataset_name="Smart-seq HCC1806"
)
==========================
evaluate_clustering(
adata_ss_hcc.obs['condition'],
adata_ss_hcc.obs['kmeans_k3'],
method_name="KMeans Clustering (HCC) k =3"
)
print("\n")
=== KMeans Clustering (HCC) k =3 Evaluation ===
ARI: 0.5161 NMI: 0.4844
Contingency Table (raw counts):
Cluster 0 1 2
True
Hypo 18 7 72
Norm 19 66 0
Row-wise percentages (each true class → clusters):
Cluster 0 1 2
True
Hypo 18.6% 7.2% 74.2%
Norm 22.4% 77.6% 0.0%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1 2
True
Hypo 48.6% 9.6% 100.0%
Norm 51.4% 90.4% 0.0%
Purity per cluster:
Purity
Cluster
0 0.513514
1 0.904110
2 1.000000
Overall purity: 0.7582
MCF7 Drop Seq¶
The optimal number of clusters, based on the silhouette score optimization, is k = 3. However, the resulting clusters do not align well with the desired biological conditions. The evaluation metrics are ARI: 0.3221 and NMI: 0.3356. Inspecting the contingency tables reveals that cluster 1 exhibits the most overlap, while the other clusters perform better in terms of purity. This observation is further supported by the PCA plot, which highlights the evident discrepancies. This again, could be caused by the fact that some cells are in a transition state between the two conditions.
kmeans_optimization(adata_ds_mcf7_scaled, rep_key="X_pca")
silhouette_diagrams(adata_ds_mcf7_scaled, k_range=range(2, 7), dataset_name="Drop-seq MCF7", rep_key="X_pca")
{2: np.float32(0.24377671),
3: np.float32(0.26212752),
4: np.float32(0.24349803),
5: np.float32(0.18707642),
6: np.float32(0.18642579)}
# scaled version
plot_kmeans_clusters(
adata_ds_mcf7_scaled,
rep_key='X_pca',
embed='tsne',
size=10,
k=3,
dataset_name = "Drop-seq MCF7"
)
print("==========================")
plot_kmeans_clusters(
adata_ds_mcf7_scaled,
embed='umap',
size=10,
rep_key='X_pca',
k=3,
dataset_name="Drop-seq MCF7"
)
print("==========================")
plot_kmeans_clusters(
adata_ds_mcf7_scaled,
embed='pca',
rep_key='X_pca',
pca_dims=(0,2),
size=10,
k=3,
dataset_name="Drop-seq MCF7"
)
==========================
==========================
evaluate_clustering(
adata_ds_mcf7_scaled.obs['condition'],
adata_ds_mcf7_scaled.obs['kmeans_k3'],
method_name="Drop-seq MCF7 KMeans (k=3)"
)
print("\n")
=== Drop-seq MCF7 KMeans (k=3) Evaluation ===
ARI: 0.3221 NMI: 0.3356
Contingency Table (raw counts):
Cluster 0 1 2
True
Hypo 5753 2603 565
Norm 160 9065 3480
Row-wise percentages (each true class → clusters):
Cluster 0 1 2
True
Hypo 64.5% 29.2% 6.3%
Norm 1.3% 71.3% 27.4%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1 2
True
Hypo 97.3% 22.3% 14.0%
Norm 2.7% 77.7% 86.0%
Purity per cluster:
Purity
Cluster
0 0.972941
1 0.776911
2 0.860321
Overall purity: 0.6852
HCC Drop Seq¶
The silhouette score peaks at k=2, and the silhouette diagram for k=2 indicates well-separated and balanced clusters. However, the resulting clusters do not correspond to the expected biological conditions when visualized across different spaces.
kmeans_optimization(adata_ds_hcc_scaled, rep_key="X_pca")
silhouette_diagrams(adata_ds_hcc_scaled, k_range=range(2, 8), dataset_name="Drop-seq HCC1806 scaled", rep_key="X_pca")
{2: np.float32(0.2080853),
3: np.float32(0.17907955),
4: np.float32(0.17723405),
5: np.float32(0.1760093),
6: np.float32(0.17995057),
7: np.float32(0.16560963)}
#scaled version
plot_kmeans_clusters(
adata_ds_hcc_scaled,
rep_key='X_pca',
embed='umap',
size=10,
k=2,
dataset_name="Drop-seq HCC1806"
)
plot_kmeans_clusters(
adata_ds_hcc_scaled,
embed='tsne',
size=10,
rep_key='X_pca',
k=2,
dataset_name="Drop-seq HCC1806"
)
plot_kmeans_clusters(
adata_ds_hcc_scaled,
embed='pca',
rep_key='X_pca',
pca_dims=(2,3),
size=10,
k=2,
dataset_name="Drop-seq HCC1806"
)
Hierarchical Clustering¶
Hierarchical clustering is an alternative to K-Means that builds a tree-like structure of nested clusters.
There are two main types of hierarchical clustering:
- Agglomerative Clustering: A "bottom-up" approach where each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive Clustering: A "top-down" approach where all data points start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Agglomerative clustering is more commonly used and is implemented in libraries like scikit-learn. It allows for different linkage criteria, such as:
- Single Linkage: Minimum distance between points in two clusters.
- Complete Linkage: Maximum distance between points in two clusters.
- Average Linkage: Average distance between all points in two clusters.
- Ward's Linkage: Minimizes the variance within clusters.
For the purpose of our analysis, we will use only agglomerative clustering, using Ward's linkage. This method is suitable for our dataset as it minimizes the variance within clusters, ensuring that the resulting clusters are compact and well-separated.
helper functions¶
# ------------------------------------------
# Step 1: Use PCA-reduced data from Scanpy
# ------------------------------------------
def plot_dendrogram(adata, use_rep='X_pca_95', title="title"):
# Check if PCA has been computed
X = adata.obsm[use_rep] # MCF7 SmartSeq PCA coords
# ------------------------------------------
# Step 2: Plot dendrogram to visualize hierarchy
# ------------------------------------------
linked = linkage(X, method='ward')
plt.figure(figsize=(12, 6))
dendrogram(
linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=False,
truncate_mode='level',
p=30
)
plt.title(f'Hierarchical Clustering Dendrogram for {title}')
plt.xlabel('Cells (truncated)')
plt.ylabel('Distance')
plt.grid(True)
plt.tight_layout()
plt.show()
# ------------------------------------------
# Step 3: Run Agglomerative Clustering
# ------------------------------------------
def run_agglo(adata, cut, components="1,2", use_rep='X_pca_95', title=""):
# Check if PCA has been computed
X = adata.obsm[use_rep] # MCF7 SmartSeq PCA coords
# Perform Agglomerative Clustering
agglo = AgglomerativeClustering(n_clusters=cut, linkage='ward')
hc_labels = agglo.fit_predict(X)
adata.obs['hc_clusters'] = hc_labels.astype(str)
# ------------------------------------------
# Step 4: Plot PCA and UMAP colored by cluster vs. condition
# ------------------------------------------
fig, axes = plt.subplots(2, 2, figsize=(16, 14))
# Top row: PCA
sc.pl.pca(
adata,
color='hc_clusters',
ax=axes[0, 0],
show=False,
size=50,
components=components
)
axes[0, 0].set_title(f"PCA: Agglomerative Clusters for {title}, {cut} clusters")
sc.pl.pca(
adata,
color='condition',
ax=axes[0, 1],
show=False,
size=50,
components=components
)
axes[0, 1].set_title(f"PCA: Original Condition for {title}, {cut} clusters")
# Bottom row: UMAP
sc.pl.umap(
adata,
color='hc_clusters',
ax=axes[1, 0],
show=False,
size=50
)
axes[1, 0].set_title(f'UMAP: Agglomerative Clusters for {title}, {cut} clusters')
sc.pl.umap(
adata,
color='condition',
ax=axes[1, 1],
show=False,
size=50
)
axes[1, 1].set_title(f'UMAP: Original Condition for {title}, {cut} clusters')
plt.tight_layout()
plt.show()
MCF7 Smart Seq¶
Using the agglomerative clustering technique, the clusters identified for this dataset are highly consistent. The dendrogram reveals a clear separation with two distinct branches, which correspond accurately to the biological conditions. The overall purity achieved is 0.9840, surpassing the results obtained with K-Means clustering.
plot_dendrogram(adata_ss_mcf7, use_rep='X_pca_95', title="MCF7 SmartSeq")
run_agglo(adata_ss_mcf7, cut=2, components="1,2", use_rep='X_pca_95', title="MCF7 SmartSeq")
evaluate_clustering(
adata_ss_mcf7.obs['condition'],
adata_ss_mcf7.obs['hc_clusters'],
method_name="Smart-seq HCC1806 Agglomerative Clustering (k=2)"
)
print("\n")
=== Smart-seq HCC1806 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.9368 NMI: 0.8974
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 124 0
Norm 4 122
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 100.0% 0.0%
Norm 3.2% 96.8%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 96.9% 0.0%
Norm 3.1% 100.0%
Purity per cluster:
Purity
Cluster
0 0.96875
1 1.00000
Overall purity: 0.9840
HCC Smart Seq¶
Consistent with the findings from K-Means clustering, it is challenging to split the HCC SmartSeq dataset into two distinct clusters. In this cell, we plot the dendrogram and observe that the largest jump suggests two clusters. However, one could also argue that six clusters might be a reasonable choice, so we inspect both scenarios.
For two clusters, the results are suboptimal, with low ARI and NMI scores. When using six clusters, both ARI and NMI scores improve (ARI: 0.3229, NMI: 0.3799), but the values remain relatively low. This is expected, as these metrics tend to perform worse when the number of clusters exceeds the number of biological conditions. The six-cluster solution may capture additional subgroups, potentially representing cells in transitional states or other biological variations within the dataset.
plot_dendrogram(adata_ss_hcc, use_rep='X_pca_95', title="HCC1806 SmartSeq")
run_agglo(adata_ss_hcc, cut=2, components='2,3', use_rep='X_pca_95', title="HCC1806 SmartSeq")
evaluate_clustering(
adata_ss_hcc.obs['condition'],
adata_ss_hcc.obs['hc_clusters'],
method_name="Smart-seq HCC1806 Agglomerative Clustering (k=2)"
)
print("\n")
=== Smart-seq HCC1806 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.0257 NMI: 0.0201
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 27 70
Norm 37 48
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 27.8% 72.2%
Norm 43.5% 56.5%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 42.2% 59.3%
Norm 57.8% 40.7%
Purity per cluster:
Purity
Cluster
0 0.578125
1 0.593220
Overall purity: 0.6484
run_agglo(adata_ss_hcc, cut=6, components='2,3', use_rep='X_pca_95', title="HCC1806 SmartSeq")
evaluate_clustering(
adata_ss_hcc.obs['condition'],
adata_ss_hcc.obs['hc_clusters'],
method_name="Smart-seq HCC1806 Agglomerative Clustering (k=6)"
)
print("\n")
=== Smart-seq HCC1806 Agglomerative Clustering (k=6) Evaluation ===
ARI: 0.3229 NMI: 0.3799
Contingency Table (raw counts):
Cluster 0 1 2 3 4 5
True
Hypo 25 1 8 4 58 1
Norm 36 1 0 48 0 0
Row-wise percentages (each true class → clusters):
Cluster 0 1 2 3 4 5
True
Hypo 25.8% 1.0% 8.2% 4.1% 59.8% 1.0%
Norm 42.4% 1.2% 0.0% 56.5% 0.0% 0.0%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1 2 3 4 5
True
Hypo 41.0% 50.0% 100.0% 7.7% 100.0% 100.0%
Norm 59.0% 50.0% 0.0% 92.3% 0.0% 0.0%
Purity per cluster:
Purity
Cluster
0 0.590164
1 0.500000
2 1.000000
3 0.923077
4 1.000000
5 1.000000
Overall purity: 0.5824
MCF7 DropSeq¶
Upon inspecting the dendrograms, two and three clusters emerge as reasonable choices for identifying distinct groups. For two clusters, all evaluation metrics are superior. Notably, in the UMAP plot, the two cluster "1" contains most of the values labelled as normoxia, but it also contains a decent chunk of values of the wrong condition.
Adding a third cluster does not improve the metrics and instead results in a decline in performance for ARI and NMI scores. The additional cluster appears in a region already well-defined by the initial two clusters, offering no new insights and reducing the overall performance by both ARI and NMI.
plot_dendrogram(adata_ds_mcf7_scaled, use_rep='X_pca', title="MCF7 DropSeq")
run_agglo(adata_ds_mcf7_scaled, cut=2, components='1,3', use_rep='X_pca', title="MCF7 DropSeq")
evaluate_clustering(
adata_ds_mcf7_scaled.obs['condition'],
adata_ds_mcf7_scaled.obs['hc_clusters'],
method_name="Drop-seq MCF7 Agglomerative Clustering (k=2)"
)
print("\n")
=== Drop-seq MCF7 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.5094 NMI: 0.4794
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 2984 5937
Norm 12616 89
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 33.4% 66.6%
Norm 99.3% 0.7%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 19.1% 98.5%
Norm 80.9% 1.5%
Purity per cluster:
Purity
Cluster
0 0.808718
1 0.985231
Overall purity: 0.8579
run_agglo(adata_ds_mcf7_scaled, cut=3, components='1,3', use_rep='X_pca' , title="MCF7 DropSeq")
evaluate_clustering(
adata_ds_mcf7_scaled.obs['condition'],
adata_ds_mcf7_scaled.obs['hc_clusters'],
method_name="Drop-seq MCF7 Agglomerative Clustering (k=3)"
)
print("\n")
=== Drop-seq MCF7 Agglomerative Clustering (k=3) Evaluation ===
ARI: 0.3801 NMI: 0.4052
Contingency Table (raw counts):
Cluster 0 1 2
True
Hypo 2906 5937 78
Norm 10479 89 2137
Row-wise percentages (each true class → clusters):
Cluster 0 1 2
True
Hypo 32.6% 66.6% 0.9%
Norm 82.5% 0.7% 16.8%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1 2
True
Hypo 21.7% 98.5% 3.5%
Norm 78.3% 1.5% 96.5%
Purity per cluster:
Purity
Cluster
0 0.782891
1 0.985231
2 0.964786
Overall purity: 0.7591
HCC DropSeq¶
For this cell line, the dendrograms suggest reasonable cluster numbers of 2 and 4. We evaluate both scenarios. While the ARI is low, the 4-cluster solution shows interesting results, as it appears to represent subgroups predicting hypoxia or normoxia. In contrast, the 2-cluster solution lacks a clear biological interpretation in the context of our analysis.
plot_dendrogram(adata_ds_hcc_scaled, use_rep="X_pca", title="HCC1806 DropSeq")
run_agglo(adata_ds_hcc_scaled, cut=2, components="3,4", use_rep='X_pca', title="HCC1806 DropSeq")
evaluate_clustering(
adata_ds_hcc_scaled.obs['condition'],
adata_ds_hcc_scaled.obs['hc_clusters'],
method_name="Drop-seq HCC1806 Agglomerative Clustering (k=2)"
)
print("\n")
=== Drop-seq HCC1806 Agglomerative Clustering (k=2) Evaluation ===
ARI: 0.0640 NMI: 0.0364
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 6186 2713
Norm 2741 3042
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 69.5% 30.5%
Norm 47.4% 52.6%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 69.3% 47.1%
Norm 30.7% 52.9%
Purity per cluster:
Purity
Cluster
0 0.692954
1 0.528584
Overall purity: 0.6285
run_agglo(adata_ds_hcc_scaled, cut=4, components="3,4", use_rep='X_pca', title="HCC1806 DropSeq")
evaluate_clustering(
adata_ds_hcc_scaled.obs['condition'],
adata_ds_hcc_scaled.obs['hc_clusters'],
method_name="Drop-seq HCC1806 Agglomerative Clustering (k=4)"
)
print("\n")
=== Drop-seq HCC1806 Agglomerative Clustering (k=4) Evaluation ===
ARI: 0.2156 NMI: 0.2104
Contingency Table (raw counts):
Cluster 0 1 2 3
True
Hypo 998 2216 5188 497
Norm 1634 210 1107 2832
Row-wise percentages (each true class → clusters):
Cluster 0 1 2 3
True
Hypo 11.2% 24.9% 58.3% 5.6%
Norm 28.3% 3.6% 19.1% 49.0%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1 2 3
True
Hypo 37.9% 91.3% 82.4% 14.9%
Norm 62.1% 8.7% 17.6% 85.1%
Purity per cluster:
Purity
Cluster
0 0.620821
1 0.913438
2 0.824146
3 0.850706
Overall purity: 0.5462
Leiden clustering (Scanpy)¶
Leiden clustering is a community detection algorithm commonly used in single-cell analysis to identify clusters of similar cells. It is an improvement over the Louvain algorithm, offering better partition quality and robustness.
In Scanpy, the sc.tl.leiden() function is used to perform Leiden clustering. It requires a k-NN graph, which was computed running sc.pp.neighbors() in the KNN section.
Paramters:
resolution: Controls the granularity of the clustering. Higher values result in more clusters. In our analysis, the value of resolution was selected in such a way that the number of cluster was minimal but still they were approximately evenly distributed
Leiden clustering is particularly effective for identifying subpopulations in high-dimensional single-cell datasets, making it a powerful tool for exploratory data analysis.
MCF7 Smart Seq¶
As we expected the Leiden clustering results for the MCF7 Smart Seq dataset reveal a clear separation into two clusters, corresponding to hypoxic and normoxic conditions. This is consistent both for scaled and uscaled version. Overall purity is : 0.9920, the best so far for this dataset.
# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
sc.tl.leiden(adata_ss_mcf7, flavor="igraph", n_iterations=2, random_state=42, resolution=0.1)
sc.pl.umap(
adata_ss_mcf7,
color=['leiden', 'condition'],
show=False,
size=40,
title=["Leiden Clustering: MCF7 Smart-seq", "Condition for MCF7 Smart-seq"]
)
[<Axes: title={'center': 'Leiden Clustering: MCF7 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>,
<Axes: title={'center': 'Condition for MCF7 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>]
#plot also in the pca space
sc.pl.pca(
adata_ss_mcf7,
color=['leiden', 'condition'],
show=False,
size=40,
components="1,6",
title=["Leiden Clustering: MCF7 Drop-seq scaled", "Condition for MCF7 Drop-seq scaled"]
)
[<Axes: title={'center': 'Leiden Clustering: MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC6'>,
<Axes: title={'center': 'Condition for MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC6'>]
evaluate_clustering(
adata_ss_mcf7.obs['condition'],
adata_ss_mcf7.obs['leiden'],
method_name="Smart-seq MCF7 Agglomerative Clustering"
)
print("\n")
=== Smart-seq MCF7 Agglomerative Clustering Evaluation ===
ARI: 0.9681 NMI: 0.9407
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 2 122
Norm 126 0
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 1.6% 98.4%
Norm 100.0% 0.0%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 1.6% 100.0%
Norm 98.4% 0.0%
Purity per cluster:
Purity
Cluster
0 0.984375
1 1.000000
Overall purity: 0.9920
HCC Smart Seq¶
Leiden clustering provides better-defined clusters compared to KMeans and Agglomerative, with fewer misclassifications. However, some overlap is still observed at the cluster boundaries, potentially reflecting transitional states in cells where hypoxia was developing but not fully expressed. In any case, this is the best result we got from clustering for this cell line with overall purity of 0.9505.
sc.tl.leiden(adata_ss_hcc, flavor="igraph", n_iterations=2, random_state=42, resolution=0.3)
sc.pl.umap(
adata_ss_hcc,
color=['leiden', 'condition'],
show=False,
size=40,
title=["Leiden Clustering: HCC1806 Smart-seq", "Condition for HCC1806 Smart-seq"]
)
[<Axes: title={'center': 'Leiden Clustering: HCC1806 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>,
<Axes: title={'center': 'Condition for HCC1806 Smart-seq'}, xlabel='UMAP1', ylabel='UMAP2'>]
#plot also in the pca space
sc.pl.pca(
adata_ss_hcc,
color=['leiden', 'condition'],
show=False,
size=40,
components="2,3",
title=["Leiden Clustering: HCC Drop-seq scaled", "Condition for HCC Drop-seq scaled"]
)
[<Axes: title={'center': 'Leiden Clustering: HCC Drop-seq scaled'}, xlabel='PC2', ylabel='PC3'>,
<Axes: title={'center': 'Condition for HCC Drop-seq scaled'}, xlabel='PC2', ylabel='PC3'>]
evaluate_clustering(
adata_ss_hcc.obs['condition'],
adata_ss_hcc.obs['leiden'],
method_name="Smart-seq HCC1806"
)
print("\n")
=== Smart-seq HCC1806 Evaluation ===
ARI: 0.8109 NMI: 0.7390
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 8 89
Norm 84 1
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 8.2% 91.8%
Norm 98.8% 1.2%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 8.7% 98.9%
Norm 91.3% 1.1%
Purity per cluster:
Purity
Cluster
0 0.913043
1 0.988889
Overall purity: 0.9505
MCF7 Drop Seq¶
Using a lower resolution parameter (0.1) we observe that leiden clusters align almost pefectly with conditions, reaching an overall purity of 0.9767, the best one for this dataset so far.
# scaled version
sc.tl.leiden(adata_ds_mcf7_scaled, flavor="igraph", n_iterations=2, random_state=42, resolution=0.1)
sc.pl.umap(
adata_ds_mcf7_scaled,
color=['leiden', 'condition'],
show=False,
size=6,
title=["Leiden Clustering: MCF7 Drop-seq scaled", "Condition for MCF7 Drop-seq scaled"]
)
[<Axes: title={'center': 'Leiden Clustering: MCF7 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>,
<Axes: title={'center': 'Condition for MCF7 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>]
#plot also in the pca space
sc.pl.pca(
adata_ds_mcf7_scaled,
color=['leiden', 'condition'],
show=False,
size=20,
components="1,3",
title=["Leiden Clustering: MCF7 Drop-seq scaled", "Condition for MCF7 Drop-seq scaled"]
)
[<Axes: title={'center': 'Leiden Clustering: MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC3'>,
<Axes: title={'center': 'Condition for MCF7 Drop-seq scaled'}, xlabel='PC1', ylabel='PC3'>]
evaluate_clustering(
adata_ds_mcf7_scaled.obs['condition'],
adata_ds_mcf7_scaled.obs['leiden'],
method_name=" Drop-seq MCF7 Leiden Clustering"
)
print("\n")
=== Drop-seq MCF7 Leiden Clustering Evaluation ===
ARI: 0.9090 NMI: 0.8449
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 417 8504
Norm 12619 86
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 4.7% 95.3%
Norm 99.3% 0.7%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 3.2% 99.0%
Norm 96.8% 1.0%
Purity per cluster:
Purity
Cluster
0 0.968012
1 0.989988
Overall purity: 0.9767
HCC Drop Seq¶
Also in this case, using a lower resolution (0.16) leads to almost perfect clusters, with overall purity of 0.9409.
sc.tl.leiden(adata_ds_hcc_scaled, flavor="igraph", n_iterations=2, random_state=42, resolution=0.16)
sc.pl.umap(
adata_ds_hcc_scaled,
color=['leiden', 'condition'],
show=False,
size=20,
title=["Leiden Clustering: HCC1806 Drop-seq scaled", "Condition for HCC1806 Drop-seq scaled"]
)
[<Axes: title={'center': 'Leiden Clustering: HCC1806 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>,
<Axes: title={'center': 'Condition for HCC1806 Drop-seq scaled'}, xlabel='UMAP1', ylabel='UMAP2'>]
#plot also in the pca space
sc.pl.pca(
adata_ds_hcc_scaled,
color=['leiden', 'condition'],
show=False,
size=20,
components="3,4",
title=["Leiden Clustering: HCC Drop-seq scaled", "Condition for HCC Drop-seq scaled"]
)
[<Axes: title={'center': 'Leiden Clustering: HCC Drop-seq scaled'}, xlabel='PC3', ylabel='PC4'>,
<Axes: title={'center': 'Condition for HCC Drop-seq scaled'}, xlabel='PC3', ylabel='PC4'>]
evaluate_clustering(
adata_ds_hcc_scaled.obs['condition'],
adata_ds_hcc_scaled.obs['leiden'],
method_name="Smart-seq HCC1806 Leiden Clustering"
)
print("\n")
=== Smart-seq HCC1806 Leiden Clustering Evaluation ===
ARI: 0.7775 NMI: 0.6787
Contingency Table (raw counts):
Cluster 0 1
True
Hypo 634 8265
Norm 5550 233
Row-wise percentages (each true class → clusters):
Cluster 0 1
True
Hypo 7.1% 92.9%
Norm 96.0% 4.0%
Column-wise percentages (each cluster ← true classes):
Cluster 0 1
True
Hypo 10.3% 97.3%
Norm 89.7% 2.7%
Purity per cluster:
Purity
Cluster
0 0.897477
1 0.972582
Overall purity: 0.9409
Conclusions¶
Clustering Performance:
- The Leiden clustering algorithm consistently outperformed K-Means and Agglomerative clustering across all datasets, achieving the highest overall purity scores.
- MCF7 Smart Seq: Leiden clustering achieved an overall purity of 0.9920, the best result for this dataset.
- HCC Smart Seq: Leiden clustering achieved an overall purity of 0.9505, significantly better than other methods.
- MCF7 Drop Seq: Leiden clustering achieved an overall purity of 0.9767, outperforming other clustering techniques.
- HCC Drop Seq: Leiden clustering achieved an overall purity of 0.9409, demonstrating its robustness.
- The Leiden clustering algorithm consistently outperformed K-Means and Agglomerative clustering across all datasets, achieving the highest overall purity scores.
Biological Insights:
- For MCF7 Smart Seq, the clusters identified by all the techniques aligned almost perfectly with the biological conditions (hypoxia vs. normoxia), indicating a clear separation between the two states.
- For HCC Smart Seq, while Leiden clustering provided better-defined clusters, some overlap at the boundaries suggests the presence of transitional states in cells.
- For MCF7 Drop Seq, the clusters identified by Leiden clustering were highly consistent with the biological conditions, highlighting its effectiveness in scaled datasets.
- For HCC Drop Seq, the clusters revealed by Leiden clustering were well-separated, but some subgroups found by Agglomerativev Clustering may represent additional biological variations.
Limitations:
- K-Means and Agglomerative clustering struggled to capture the underlying structure of the data, often resulting in lower ARI and NMI scores compared to Leiden clustering.
- The ARI and NMI metrics tend to perform worse in scenarios where there are more clusters than conditions.
Recommendations:
- For future analyses, Leiden clustering is recommended as the primary clustering method due to its superior performance and ability to handle complex single-cell datasets.
- Further investigation into transitional states and subgroups in datasets could provide deeper biological insights.
- Parameter tuning, such as adjusting the resolution in Leiden clustering, can further optimize clustering results for specific datasets.
Supervised Learning: Hypoxia vs Normoxia¶
The search for a classifier involves the use of logistic regression, SVM, random forest, and multilayer perceptron models. This diversity in models allows for a more robust classifier.
These individual models are initially trained on PCA-transformed data, since the full data sets are very high-dimensional and require significant processing power. Then, feature selection for genes in the original (not PCA-transformed) data is done as well as feature selection of the PCA-transformed data to identify the top principal components.
Finally, a simple ensemble model takes the majority votes of the models trained on the selected genes for each data set. A larger generalized ensemble model then takes the ensemble of these four simples ensemble models, each trained on a different data set, and takes the majority vote of their predictions.
Preparation¶
Data¶
Earlier we extracted the PCA-transformed features using the number of components required to explain 95% of the variance in each dataset. We define X_pca_ss_mcf7 and y_pca_ss_mcf7, where X_pca_ss_mcf7 contains the reduced feature representation of each cell (principal components), and y_pca_ss_mcf7 contains the corresponding condition labels (“Hypoxia” or “Normoxia”) for each cell. These will be used as input features and target labels, respectively, for training supervised classification models.
X_pca_ss_mcf7 = adata_ss_mcf7.obsm['X_pca_95']
X_pca_ss_hcc = adata_ss_hcc.obsm['X_pca_95']
X_pca_ds_mcf7 = adata_ds_mcf7.obsm['X_pca_95']
X_pca_ds_hcc = adata_ds_hcc.obsm['X_pca_95']
y_pca_ss_mcf7 = adata_ss_mcf7.obs['condition'].values
y_pca_ss_hcc = adata_ss_hcc.obs['condition'].values
y_pca_ds_mcf7 = adata_ds_mcf7.obs['condition'].values
y_pca_ds_hcc = adata_ds_hcc.obs['condition'].values
print(
f"SmartSeq MCF7 X_pca shape: {X_pca_ss_mcf7.shape}, y_pca shape: {y_pca_ss_mcf7.shape}",
f"SmartSeq HCC1806 X_pca shape: {X_pca_ss_hcc.shape}, y_pca shape: {y_pca_ss_hcc.shape}",
f"DropSeq MCF7 X_pca shape: {X_pca_ds_mcf7.shape}, y_pca shape: {y_pca_ds_mcf7.shape}",
f"DropSeq HCC1806 X_pca shape: {X_pca_ds_hcc.shape}, y_pca shape: {y_pca_ds_hcc.shape}",
sep = "\n"
)
SmartSeq MCF7 X_pca shape: (250, 20), y_pca shape: (250,) SmartSeq HCC1806 X_pca shape: (182, 34), y_pca shape: (182,) DropSeq MCF7 X_pca shape: (21626, 761), y_pca shape: (21626,) DropSeq HCC1806 X_pca shape: (14682, 844), y_pca shape: (14682,)
encoder = LabelEncoder()
y_pca_ss_mcf7_encoded = encoder.fit_transform(y_pca_ss_mcf7)
y_pca_ss_hcc_encoded = encoder.fit_transform(y_pca_ss_hcc)
y_pca_ds_mcf7_encoded = encoder.fit_transform(y_pca_ds_mcf7)
y_pca_ds_hcc_encoded = encoder.fit_transform(y_pca_ds_hcc)
print("Label classes:", encoder.classes_)
print("Internal encoding:", encoder.transform(encoder.classes_))
Label classes: ['Hypo' 'Norm'] Internal encoding: [0 1]
In our case:
- 0 = 'Hypo'
- 1 = 'Norm'
print("SmartSeq MCF7:", np.unique(y_pca_ss_mcf7, return_counts = True))
print("SmartSeq HCC:", np.unique(y_pca_ss_hcc, return_counts = True))
print("DropSeq MCF7:", np.unique(y_pca_ds_mcf7, return_counts = True))
print("DropSeq HCC:", np.unique(y_pca_ds_hcc, return_counts = True))
SmartSeq MCF7: (array(['Hypo', 'Norm'], dtype=object), array([124, 126])) SmartSeq HCC: (array(['Hypo', 'Norm'], dtype=object), array([97, 85])) DropSeq MCF7: (array(['Hypo', 'Norm'], dtype=object), array([ 8921, 12705])) DropSeq HCC: (array(['Hypo', 'Norm'], dtype=object), array([8899, 5783]))
Cross-validation functions¶
def summarize_crossvalidation(search: GridSearchCV | RandomizedSearchCV):
"""Summarize model data for cross-validation."""
best_model = search.best_estimator_
print("Best Parameters:", search.best_params_)
print("Best Score (CV avg):", search.best_score_)
attributes = {
"C": "C",
# Logistic regression
"penalty": "Penalty",
# SVM
"intercept_": "Intercept",
"max_iter": "Max Iterations",
"n_iter_": "Number of iterations for convergence",
# Random forest
"n_estimators": "Number of decision trees",
"max_depth": "Maximum tree depth",
"min_samples_split": "Minimum samples to split",
"min_samples_leaf": "Minimum samples per leaf",
"max_features": "Maximum features considered at each split",
"bootstrap": "Bootstrap",
"feature_importances_": "Feature importances",
}
for attribute, name in attributes.items():
if hasattr(best_model, attribute):
print(f"{name}:", getattr(best_model, attribute))
Plotting functions¶
Plot the learning curve to avoid excessive computations.
def plot_learning_curve(
search: GridSearchCV | RandomizedSearchCV,
param_names: str | list[str],
plot_title: str = "Learning Curve",
scoring_label: str | None = None,
log_scale_params: list[str] | None = None
):
if not isinstance(param_names, list):
param_names = [param_names]
if log_scale_params is None:
log_scale_params = []
results = search.cv_results_
n_params = len(param_names)
# Adjust figure size depending on number of subplots
fig, axes = plt.subplots(n_params, 1, figsize = (10, 4 * n_params), squeeze = False)
if scoring_label is None:
scoring_label = search.scoring if isinstance(search.scoring, str) else "score"
for i, param in enumerate(param_names):
raw_values = [params[param] for params in results["params"]]
# Detect type: numeric or not
if all(isinstance(val, (int, float)) for val in raw_values):
param_range = np.array(raw_values)
unique_param_range = np.unique(param_range)
is_numeric = True
else:
param_range = [str(val) for val in raw_values]
unique_param_range = sorted(set(param_range))
is_numeric = False
train_scores = []
val_scores = []
std_scores = []
for value in unique_param_range:
if is_numeric:
mask = param_range == value
else:
mask = [v == value for v in param_range]
train_scores.append(np.mean(np.array(results["mean_train_score"])[mask]))
val_scores.append(np.mean(np.array(results["mean_test_score"])[mask]))
std_scores.append(np.mean(np.array(results["std_test_score"])[mask]))
axis = axes[i, 0]
x_values = unique_param_range if is_numeric else range(len(unique_param_range))
# Plot training scores on left y-axis
axis.plot(x_values, train_scores, label = "Training score", marker = "o", color = "tab:blue")
axis.set_ylabel(f"Train {scoring_label}", color = "tab:blue")
axis.tick_params(axis = "y", labelcolor = "tab:blue")
# Plot validation scores on right y-axis
axis2 = axis.twinx()
axis2.plot(x_values, val_scores, label = "Validation score", marker = "s", color = "tab:orange")
axis2.fill_between(
x_values,
np.array(val_scores) - np.array(std_scores),
np.array(val_scores) + np.array(std_scores),
alpha = 0.2,
color = "tab:orange"
)
axis2.set_ylabel(f"Validation {scoring_label}", color = "tab:orange")
axis2.tick_params(axis = "y", labelcolor = "tab:orange")
axis.set_title(f"{plot_title} ({param})")
axis.set_xlabel(param)
axis.grid(True)
if param in log_scale_params:
axis.set_xscale("log")
axis2.set_xscale("log")
if not is_numeric:
axis.set_xticks(x_values)
axis.set_xticklabels(unique_param_range, rotation = 45)
plt.tight_layout()
plt.show()
Test function¶
def test_model(model, X_test, y_test, verbose: bool = True):
if verbose:
print("========================= Testing =========================")
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
if verbose:
cm = confusion_matrix(y_test, predictions)
labels = ['Hypo', 'Norm']
cm_df = pd.DataFrame(cm, index=[f"Actual {l}" for l in labels], columns = [f"Predicted {l}" for l in labels])
print("Confusion matrix:")
print(cm_df)
print("Accuracy:", accuracy)
print("Classification report:\n", classification_report(y_test, predictions))
return accuracy
Custom classifier class¶
This classifier wrapper class allows the specific train-test split to be stored alongside the model to allow ensembling without reusing train data.
class TrainedModelWrapper:
def __init__(
self,
model: BaseEstimator,
X,
y,
X_train,
y_train,
X_test,
y_test,
accuracy: float
):
self.model = model
self.X = X
self.y = y
self.X_train = X_train
self.y_train = y_train
self.X_test = X_test
self.y_test = y_test
self.accuracy = accuracy
def predict(self, X):
return self.model.predict(X)
def score(self, X, y):
return self.model.score(X, y)
def summary(self, verbose = True):
if verbose:
print("Model:", type(self.model).__name__)
print("Training set size:", self.X_train.shape)
print("Test set size:", self.X_test.shape)
print("Accuracy on test set:", self.accuracy)
Logistic Regression¶
Logistic regression provides a simple and interpretable linear model which is well-suited for binary classification. The model's coefficients provide insight into the importance of the features.
Key hyperparameters¶
penalty: L2 regularization is chosen to prevent overfitting and prioritize accuracy and stability.C: regularization strength that controls trade-off between fitting the data well and regularizing the coefficients.solver: For larger data sets, theSAG(stochastic average gradient) solver is used to approximate gradient descent in a way that scales better with large data.
Training¶
Grid/randomized search cross-validation is an important step of training to identify the optimal hyperparameters.
def train_logistic_regression(
X_train,
y_train,
random_state: int | None = None,
n_jobs: int | None = None,
verbose: bool = True
):
if verbose:
print("========================= Training =========================")
n_samples = X_train.shape[0]
params = {
"penalty": ["l2"],
"C": [0.01, 0.1, 1, 10],
} if n_samples < 10_000 else {
"penalty": ["l2"],
"C": [0.01, 0.1, 1, 10],
"solver": ["sag"] # More efficient on larger data sets
}
model = GridSearchCV(
estimator = LogisticRegression(max_iter = 20_000, random_state = random_state, n_jobs = n_jobs),
param_grid = params,
refit = True,
cv = 5,
n_jobs = n_jobs,
return_train_score = True
) if n_samples < 10_000 else RandomizedSearchCV(
estimator = LogisticRegression(max_iter = 20_000, random_state = random_state, n_jobs = n_jobs),
param_distributions = params,
random_state = random_state,
refit = True,
cv = 5,
n_jobs = n_jobs,
return_train_score = True
)
model.fit(X_train, y_train)
if verbose:
summarize_crossvalidation(model)
print("Training accuracy:", model.score(X_train, y_train))
return model.best_estimator_
Evaluation¶
The train-test split is stratified to ensure the split is representative of the labels.
def train_test_logistic_regression(
X,
y,
test_size: float = 0.25,
train_size: float | None = None,
random_state: int = 10,
n_jobs: int | None = None,
verbose: bool = True
):
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
)
if verbose:
print("Training data dimensions:", X_train.shape)
print("Testing data dimensions:", X_test.shape)
# Train the model
model = train_logistic_regression(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs = n_jobs, verbose = verbose)
# Evaluate the model
accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
return TrainedModelWrapper(
model = model,
X = X,
y = y,
X_train = X_train,
y_train = y_train,
X_test = X_test,
y_test = y_test,
accuracy = accuracy
)
ss_mcf7_pca_logit = train_test_logistic_regression(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'C': 0.01, 'penalty': 'l2'}
Best Score (CV avg): 0.9891891891891891
C: 0.01
Penalty: l2
Intercept: [-8.10718121e-07]
Max Iterations: 10000
Number of iterations for convergence: [35]
Training accuracy: 1.0
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
ss_hcc_pca_logit = train_test_logistic_regression(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'C': 0.01, 'penalty': 'l2'}
Best Score (CV avg): 0.978042328042328
C: 0.01
Penalty: l2
Intercept: [-1.46755599e-06]
Max Iterations: 10000
Number of iterations for convergence: [46]
Training accuracy: 1.0
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 24 1
Actual Norm 0 21
Accuracy: 0.9782608695652174
Classification report:
precision recall f1-score support
Hypo 1.00 0.96 0.98 25
Norm 0.95 1.00 0.98 21
accuracy 0.98 46
macro avg 0.98 0.98 0.98 46
weighted avg 0.98 0.98 0.98 46
ds_mcf7_pca_logit = train_test_logistic_regression(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'solver': 'sag', 'penalty': 'l2', 'C': 0.1}
Best Score (CV avg): 0.9792835217881786
C: 0.1
Penalty: l2
Intercept: [1.]
Max Iterations: 20000
Number of iterations for convergence: [20000]
Training accuracy: 0.9901966829027684
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2166 64
Actual Norm 46 3131
Accuracy: 0.9796560014795636
Classification report:
precision recall f1-score support
Hypo 0.98 0.97 0.98 2230
Norm 0.98 0.99 0.98 3177
accuracy 0.98 5407
macro avg 0.98 0.98 0.98 5407
weighted avg 0.98 0.98 0.98 5407
ds_hcc_pca_logit = train_test_logistic_regression(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844) Testing data dimensions: (3671, 844) ========================= Training =========================
Best Parameters: {'solver': 'sag', 'penalty': 'l2', 'C': 0.1}
Best Score (CV avg): 0.9554083833332715
C: 0.1
Penalty: l2
Intercept: [-2.1879232]
Max Iterations: 20000
Number of iterations for convergence: [132]
Training accuracy: 0.9741167922986105
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2116 109
Actual Norm 80 1366
Accuracy: 0.9485153909016617
Classification report:
precision recall f1-score support
Hypo 0.96 0.95 0.96 2225
Norm 0.93 0.94 0.94 1446
accuracy 0.95 3671
macro avg 0.94 0.95 0.95 3671
weighted avg 0.95 0.95 0.95 3671
Support Vector Machine¶
Support vector machine is an effective classifier in high-dimensional data and can use the kernel trick for non-linear boundaries resulting from complex relationships in the data (however, LinearSVC is found to be the best classifier).
Key hyperparameters¶
penalty: L2 regularization is used by default inLinearSVC.C: Controls the regularization strength.
Training¶
def train_svm(
X_train,
y_train,
random_state: int | None = None,
n_jobs: int | None = None,
verbose: bool = True
):
if verbose:
print("========================= Training =========================")
params = {
"C": [0.025, 0.05, 0.1, 1, 10, 50]
}
model = GridSearchCV(
estimator = LinearSVC(random_state = random_state, max_iter = 10_000),
param_grid = params,
refit = True,
cv = 5,
n_jobs = n_jobs,
return_train_score = True
)
model.fit(X_train, y_train)
if verbose:
summarize_crossvalidation(model)
print("Training accuracy:", model.score(X_train, y_train))
plot_learning_curve(model, list(params.keys()))
return model.best_estimator_
Evaluation¶
def train_test_svm(
X,
y,
test_size: float = 0.25,
train_size: float | None = None,
random_state: int = 10,
n_jobs: int | None = None,
verbose: bool = True
):
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
)
if verbose:
print("Training data dimensions:", X_train.shape)
print("Testing data dimensions:", X_test.shape)
# Train the model
model = train_svm(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs = n_jobs, verbose = verbose)
# Evaluate the model
accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
return TrainedModelWrapper(
model = model,
X = X,
y = y,
X_train = X_train,
y_train = y_train,
X_test = X_test,
y_test = y_test,
accuracy = accuracy
)
ss_mcf7_pca_svm = train_test_svm(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9891891891891891
C: 0.025
Penalty: l2
Intercept: [-1.68452782e-08]
Max Iterations: 10000
Number of iterations for convergence: 8
Training accuracy: 1.0
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
ss_hcc_pca_svm = train_test_svm(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9634920634920634
C: 0.025
Penalty: l2
Intercept: [-1.02120292e-07]
Max Iterations: 10000
Number of iterations for convergence: 9
Training accuracy: 1.0
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 24 1
Actual Norm 0 21
Accuracy: 0.9782608695652174
Classification report:
precision recall f1-score support
Hypo 1.00 0.96 0.98 25
Norm 0.95 1.00 0.98 21
accuracy 0.98 46
macro avg 0.98 0.98 0.98 46
weighted avg 0.98 0.98 0.98 46
ds_mcf7_pca_svm = train_test_svm(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9773103826395694
C: 0.025
Penalty: l2
Intercept: [0.32791924]
Max Iterations: 10000
Number of iterations for convergence: 8
Training accuracy: 0.9903199950675134
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2163 67
Actual Norm 51 3126
Accuracy: 0.9781764379508046
Classification report:
precision recall f1-score support
Hypo 0.98 0.97 0.97 2230
Norm 0.98 0.98 0.98 3177
accuracy 0.98 5407
macro avg 0.98 0.98 0.98 5407
weighted avg 0.98 0.98 0.98 5407
ds_hcc_pca_svm = train_test_svm(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844)
Testing data dimensions: (3671, 844)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9528651995070714
C: 0.025
Penalty: l2
Intercept: [-0.82237322]
Max Iterations: 10000
Number of iterations for convergence: 8
Training accuracy: 0.9789301607483426
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2112 113
Actual Norm 82 1364
Accuracy: 0.9468809588667938
Classification report:
precision recall f1-score support
Hypo 0.96 0.95 0.96 2225
Norm 0.92 0.94 0.93 1446
accuracy 0.95 3671
macro avg 0.94 0.95 0.94 3671
weighted avg 0.95 0.95 0.95 3671
Random Forest¶
Random forest leverages the power of ensembling to provide accurate predictions while also being able to handle non-linearity and interactions between features. The feature importance scores help with interpretability of the model and the features which contribute to hypoxia classification. Bootstrap aggregation and random feature selection make this model very robust.
Key hyperparameters¶
n_estimators: The number of decision trees in the forest. More trees can improve performance, given a sufficiently large data set, but increase training time.max_depth: The maximum depth of each tree limits model complexity to prevent overfitting.min_samples_split: The minimum number of samples required to split an internal node.- Higher values reduce overfitting.
min_samples_leaf: The minimum number of samples required to be in a leaf node.- Helps smooth the model and prevent learning from outliers.
max_features: The number of features to consider when looking for the best split.- Controls tree diversity and model variance.
bootstrap: Whether bootstrap samples are used when building trees.- Introduces randomness for better generalization.
Training¶
There are different parameter options depending on the size of the training set to avoid overfitting while also minimizing computational complexity. Grid search is used for cross-validation on smaller data sets and randomized search on larger sets as the computation time increases significantly with the size of the data set. Plots of the learning curves for each hyperparameter are helpful for narrowing down the pool of hyperparameters to then run more comprehensive searches.
As the data set gets larger, the model becomes more robust to the individual decision trees, so certain hyperparameter pools like the number of trees can be relaxed.
Initially, the confusion matrix for the model (specifically on DropSeq HCC) showed the model predicting a significant portion of normoxia as hypoxia, with a 15% difference between the training and testing scores, suggesting overfitting. The scorer for GridSearchCV and RandomizedSearchCV was changed to f1_macro to better accommodate the uneven distribution of labels in the data.
def train_random_forest(
X_train,
y_train,
random_state: int | None = None,
n_jobs: int | None = None,
verbose: bool = True
):
if verbose:
print("========================= Training =========================")
n_samples = X_train.shape[0]
params = {
"n_estimators": [25, 50, 100],
"max_depth": [5, 10, 20, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4, 10],
"max_features": ["sqrt", "log2", None],
"bootstrap": [True, False]
} if n_samples < 1_000 else {
"n_estimators": [100, 200, 300, 400],
"class_weight": ["balanced"],
"max_depth": [5, 10, 20],
"min_samples_split": [5, 10, 20],
"min_samples_leaf": [1, 2, 5, 10, 25],
"max_features": ["sqrt"],
"bootstrap": [True]
} if n_samples < 15_000 else {
"n_estimators": [100, 200, 400, 600],
"class_weight": ["balanced"],
"max_depth": [10, 20, 30],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
"max_features": ["sqrt"],
"bootstrap": [True]
}
model = GridSearchCV(
estimator = RandomForestClassifier(random_state = random_state, n_jobs = n_jobs),
param_grid = params,
scoring = "f1_macro",
refit = True,
cv = 5,
n_jobs = n_jobs,
return_train_score = True
) if n_samples < 10_000 else RandomizedSearchCV(
estimator = RandomForestClassifier(random_state = random_state, n_jobs = n_jobs),
param_distributions = params,
random_state = random_state,
scoring = "f1_macro",
refit = True,
cv = 5,
n_jobs = n_jobs,
return_train_score = True
)
model.fit(X_train, y_train)
if verbose:
summarize_crossvalidation(model)
print("Training accuracy:", model.score(X_train, y_train))
plot_learning_curve(model, list(params.keys()))
return model.best_estimator_
Evaluation¶
def train_test_random_forest(
X,
y,
test_size: float = 0.25,
train_size: float | None = None,
random_state: int = 10,
n_jobs: int | None = None,
verbose: bool = True
):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
)
if verbose:
print("Training data dimensions:", X_train.shape)
print("Testing data dimensions:", X_test.shape)
model = train_random_forest(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs = n_jobs, verbose = verbose)
accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
return TrainedModelWrapper(
model = model,
X = X,
y = y,
X_train = X_train,
y_train = y_train,
X_test = X_test,
y_test = y_test,
accuracy = accuracy
)
ss_mcf7_pca_random_forest = train_test_random_forest(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'bootstrap': True, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 25}
Best Score (CV avg): 0.9945945945945945
Number of decision trees: 25
Maximum tree depth: 5
Minimum samples to split: 10
Minimum samples per leaf: 4
Maximum features considered at each split: sqrt
Bootstrap: True
Feature importances: [4.87643283e-01 7.33065976e-02 6.00266784e-02 9.91685382e-02
8.78119702e-02 8.30256017e-02 4.92950249e-03 6.29815380e-04
5.22548266e-02 1.62137250e-02 8.48520910e-04 5.76192650e-04
7.17375907e-03 1.37367232e-02 6.91512970e-04 1.09635964e-02
2.10741782e-04 5.07314934e-04 2.81099111e-04 0.00000000e+00]
Training accuracy: 1.0
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
ss_hcc_pca_random_forest = train_test_random_forest(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'bootstrap': False, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 50}
Best Score (CV avg): 0.9852743878956579
Number of decision trees: 50
Maximum tree depth: 10
Minimum samples to split: 2
Minimum samples per leaf: 10
Maximum features considered at each split: sqrt
Bootstrap: False
Feature importances: [0.0064334 0.26821343 0.39898388 0.04399438 0.0053163 0.01452682
0.01100844 0.01197652 0.01214636 0.02534943 0.0084228 0.02346837
0.00617053 0.01136463 0.01097255 0.00409977 0.00640453 0.00222945
0.01518255 0.00390206 0.00612587 0.00965581 0.00137775 0.00401086
0.00990919 0.01525441 0.00095232 0.00268207 0.01150917 0.00975953
0.00677455 0.00704284 0.01695411 0.0078253 ]
Training accuracy: 0.9926147162639153
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 23 2
Actual Norm 1 20
Accuracy: 0.9347826086956522
Classification report:
precision recall f1-score support
Hypo 0.96 0.92 0.94 25
Norm 0.91 0.95 0.93 21
accuracy 0.93 46
macro avg 0.93 0.94 0.93 46
weighted avg 0.94 0.93 0.93 46
ds_mcf7_pca_random_forest = train_test_random_forest(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'n_estimators': 600, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'class_weight': 'balanced', 'bootstrap': True}
Best Score (CV avg): 0.9246355565116522
Number of decision trees: 600
Maximum tree depth: 10
Minimum samples to split: 5
Minimum samples per leaf: 2
Maximum features considered at each split: sqrt
Bootstrap: True
Feature importances: [0.10531168 0.1298405 0.14376609 0.00890776 0.02425938 0.02095302
0.002667 0.00859892 0.00154817 0.00104473 0.00099024 0.00243691
0.00079045 0.00100409 0.00238867 0.00420052 0.00227854 0.00758007
0.00386235 0.00065713 0.00139259 0.00101725 0.00311133 0.00083539
0.00550909 0.00690945 0.00146923 0.00197005 0.00126192 0.00069935
0.00125397 0.00175273 0.00311906 0.00106822 0.00351382 0.01149722
0.01603425 0.00457864 0.000856 0.00063889 0.00058418 0.00055282
0.00046926 0.0009799 0.00039346 0.00069561 0.00615267 0.00423996
0.00028781 0.00071115 0.00033103 0.00030638 0.00031317 0.00055128
0.00434507 0.00109667 0.00033435 0.00051367 0.00043008 0.00183359
0.00028726 0.00035108 0.0004725 0.00035709 0.00052879 0.00059977
0.00036044 0.00033627 0.00049262 0.00032868 0.0002728 0.00037134
0.00028611 0.0002421 0.00067918 0.00034513 0.00031624 0.00036998
0.00036177 0.0004618 0.00050901 0.00520248 0.00038603 0.00054757
0.00092149 0.00077227 0.00048457 0.00033451 0.00029814 0.00049587
0.00048314 0.00044835 0.00026905 0.00043941 0.0035963 0.00043558
0.00365572 0.00072697 0.000371 0.00047457 0.00056056 0.00030292
0.00054543 0.0004982 0.00063516 0.00039058 0.00041866 0.00034867
0.00034766 0.00258043 0.00035767 0.00023369 0.00102244 0.0003174
0.00036132 0.00185167 0.00068316 0.00114183 0.00100102 0.00026577
0.00137913 0.00027339 0.0004236 0.00043554 0.00026215 0.00044223
0.00055951 0.00046921 0.00024452 0.00068093 0.00025074 0.00041923
0.00326383 0.00034942 0.00096892 0.00032075 0.00023898 0.00031992
0.00033645 0.00150273 0.00028185 0.00498445 0.00044432 0.00198335
0.00487231 0.00037248 0.00274954 0.00031893 0.00445734 0.00029965
0.00489857 0.00033456 0.00027911 0.00021839 0.0004373 0.00026372
0.00479623 0.00075912 0.00048689 0.00025701 0.00052364 0.00030578
0.00061907 0.00032457 0.00040429 0.00057435 0.00302461 0.00041054
0.00026883 0.00149999 0.00043445 0.00046755 0.00031501 0.00047634
0.00042197 0.0006877 0.00033929 0.00041833 0.00048601 0.00044755
0.00034759 0.00027137 0.00045445 0.00029416 0.00028925 0.00038233
0.00046582 0.00024644 0.00030989 0.00027655 0.00123483 0.000325
0.00037917 0.00031292 0.00024639 0.00034892 0.0002873 0.00034006
0.00021666 0.00032756 0.0006794 0.00028754 0.00060304 0.00037378
0.00028748 0.00043884 0.00033982 0.00031651 0.00027366 0.0002707
0.00025703 0.00024373 0.00031308 0.00026148 0.00028144 0.00029017
0.00039625 0.00024265 0.00029547 0.00032254 0.00031326 0.00040388
0.00025408 0.00040568 0.00024113 0.0003202 0.00037481 0.00039226
0.00027595 0.00036615 0.00035766 0.00145637 0.00032405 0.00037314
0.00077107 0.00147386 0.00029295 0.00075031 0.00070221 0.00040906
0.00073319 0.00092325 0.00036494 0.00039274 0.00025713 0.00043654
0.00095813 0.00038295 0.00042241 0.00026091 0.0003436 0.00052236
0.00037572 0.00038373 0.00030407 0.00029102 0.00030134 0.00042555
0.00032727 0.00048261 0.0004383 0.000307 0.00022795 0.00083452
0.00031865 0.00029358 0.0007342 0.00027078 0.0002773 0.00043468
0.00123634 0.00019869 0.00035141 0.00100078 0.0006563 0.00043935
0.00055915 0.00398463 0.00078137 0.000493 0.00060457 0.00047938
0.0003374 0.00050419 0.00044541 0.0005015 0.00051325 0.00047102
0.00031658 0.00083919 0.00044009 0.00041698 0.00031938 0.00127632
0.00079488 0.00043122 0.00131825 0.00040851 0.00048043 0.00139255
0.0019101 0.00027901 0.00165899 0.00041956 0.00047509 0.00078968
0.00197831 0.00051353 0.00035162 0.00062458 0.00083784 0.00059051
0.00071212 0.0035567 0.00031613 0.00063115 0.00450602 0.00317074
0.00054999 0.00177576 0.00451337 0.0027391 0.0005018 0.00072116
0.00030761 0.00345046 0.0011965 0.00227663 0.00029977 0.00043484
0.0013455 0.00086135 0.00040533 0.00088912 0.00684443 0.00053045
0.00072352 0.00209918 0.00051302 0.00042435 0.00038161 0.00067255
0.00041105 0.00256786 0.00094846 0.0006198 0.00044308 0.00094528
0.00051006 0.00064724 0.00043295 0.00118618 0.00051426 0.00029167
0.00039577 0.00067263 0.00071725 0.00150253 0.000357 0.00024314
0.00058133 0.0003918 0.00124965 0.00130424 0.00034879 0.00068637
0.00142809 0.00066068 0.00046248 0.00057926 0.00100908 0.00081208
0.00062834 0.00099082 0.00125176 0.00116258 0.00157917 0.00120138
0.00178949 0.00209933 0.00294813 0.00061149 0.00185401 0.00039849
0.00055386 0.00037537 0.00079814 0.00078185 0.00037376 0.00044869
0.00079072 0.00038771 0.00065747 0.00128027 0.00100872 0.00040445
0.00038032 0.00039798 0.00038978 0.00078005 0.00102374 0.00081737
0.00239271 0.00070264 0.00073869 0.00034526 0.00056761 0.0005756
0.00048465 0.00056605 0.000443 0.00084507 0.00053156 0.00053208
0.00061853 0.00083416 0.0002911 0.00053235 0.00062375 0.00049954
0.00112582 0.00250048 0.00044941 0.00141374 0.00048433 0.00113186
0.00052561 0.00041922 0.00034755 0.00056605 0.00043394 0.00037361
0.00045622 0.00067381 0.0004423 0.00053031 0.00041956 0.00107256
0.00079628 0.00037066 0.00049364 0.00169165 0.00043079 0.00081818
0.0003506 0.00050345 0.00111144 0.00041988 0.00044528 0.00103099
0.00081886 0.00075629 0.00040685 0.00030028 0.000571 0.00036758
0.00034553 0.00154734 0.00058072 0.00036249 0.00051047 0.00048956
0.00070622 0.00099259 0.00049449 0.00045869 0.00029623 0.0005214
0.00035259 0.00033357 0.00033984 0.00031595 0.00043209 0.00133602
0.00133612 0.00037854 0.00083018 0.00083259 0.00038599 0.00076312
0.00046143 0.00033456 0.00107701 0.00028666 0.00041183 0.00066119
0.00054765 0.00071805 0.00049177 0.00063936 0.00030198 0.00037442
0.00038605 0.00063903 0.000651 0.00083435 0.00032801 0.00097488
0.00054724 0.00048431 0.00039327 0.00053298 0.00056161 0.00033175
0.00027273 0.00053975 0.00043248 0.00055294 0.00027547 0.00031195
0.00041462 0.00036052 0.00030161 0.00035572 0.00041122 0.0004305
0.0004068 0.00040385 0.00038672 0.00069569 0.00027762 0.00049147
0.00036931 0.00028512 0.00028854 0.00045523 0.0003621 0.00034837
0.00023828 0.00022252 0.00033068 0.00043415 0.00034472 0.00033524
0.00025828 0.000327 0.00037402 0.00029807 0.00029158 0.0003553
0.00041216 0.00043521 0.00036297 0.00040371 0.0003658 0.00034593
0.00034808 0.00061949 0.00031006 0.0003575 0.00042709 0.00039049
0.00033915 0.00067957 0.00045901 0.00026427 0.00031127 0.00035066
0.00031878 0.00032873 0.00026781 0.00036249 0.00043142 0.00085352
0.00044547 0.0004821 0.00025455 0.00033042 0.00054598 0.00039337
0.00032047 0.00034438 0.00035296 0.00026788 0.00037184 0.00030585
0.00023866 0.00021862 0.00043793 0.00033364 0.00031334 0.00041238
0.00030189 0.00070283 0.00031154 0.00035906 0.0003818 0.00037603
0.00029103 0.00040264 0.00039551 0.00030778 0.00034299 0.00056394
0.0003813 0.0003787 0.00027861 0.00032795 0.00030901 0.00033301
0.00044004 0.00037484 0.00027191 0.00034013 0.00046865 0.00034942
0.00082124 0.0003635 0.00038131 0.00056936 0.00038337 0.00042241
0.00023855 0.0003306 0.00028391 0.00031094 0.00033818 0.00026622
0.00022612 0.00031748 0.00035502 0.00034172 0.00038318 0.00038781
0.00034458 0.00035536 0.00031096 0.00034362 0.00028379 0.00029752
0.00031086 0.00042886 0.00027878 0.00050551 0.00032571 0.00046166
0.0004208 0.00035131 0.00027304 0.00034648 0.00026529 0.0005698
0.0003201 0.00028278 0.00030133 0.00034648 0.00030764 0.00026646
0.00034035 0.00053556 0.00037732 0.00034907 0.00028583 0.00042038
0.00025485 0.00025552 0.00031885 0.00031083 0.00032842 0.00027913
0.00040893 0.00026345 0.00038301 0.00028446 0.00034416 0.00037919
0.00056237 0.00037488 0.00036957 0.00046009 0.00037346 0.00046441
0.0004007 0.00038125 0.00032505 0.00044276 0.00031539 0.00036569
0.00033101 0.00030341 0.00075396 0.00035388 0.00037215 0.00034485
0.00031414 0.00045129 0.00026111 0.00040362 0.00036975 0.00032967
0.00031811 0.00041005 0.00026292 0.00026916 0.00039454 0.00028067
0.00032729 0.00078951 0.0003763 0.00040922 0.00040862 0.00024828
0.00028074 0.00033801 0.00028914 0.00039496 0.00033345 0.00035551
0.00040217 0.00037222 0.00040346 0.00042141 0.00040244 0.00030947
0.00047869 0.00052788 0.00060206 0.00036037 0.0004646 0.00046887
0.00028266 0.00041449 0.000472 0.00032433 0.00038677 0.00039108
0.00038871 0.00040167 0.00036875 0.00034833 0.00054566 0.00048124
0.00038589 0.0003157 0.00035669 0.00034465 0.00032817 0.00035557
0.00037518 0.00058563 0.00055149 0.00043627 0.00078367 0.00047058
0.00040339 0.00036191 0.00030057 0.00049882 0.00030993 0.00052802
0.0004069 0.00046966 0.00052695 0.00048346 0.00050339 0.00037464
0.00032322 0.00060991 0.00059623 0.00044005 0.00047888]
Training accuracy: 0.9751030814126183
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2011 219
Actual Norm 136 3041
Accuracy: 0.9343443684113186
Classification report:
precision recall f1-score support
Hypo 0.94 0.90 0.92 2230
Norm 0.93 0.96 0.94 3177
accuracy 0.93 5407
macro avg 0.93 0.93 0.93 5407
weighted avg 0.93 0.93 0.93 5407
ds_hcc_pca_random_forest = train_test_random_forest(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844)
Testing data dimensions: (3671, 844)
========================= Training =========================
Best Parameters: {'n_estimators': 400, 'min_samples_split': 5, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 10, 'class_weight': 'balanced', 'bootstrap': True}
Best Score (CV avg): 0.8894451495995834
Number of decision trees: 400
Maximum tree depth: 10
Minimum samples to split: 5
Minimum samples per leaf: 5
Maximum features considered at each split: sqrt
Bootstrap: True
Feature importances: [0.01556211 0.01332938 0.02437381 0.00887438 0.05431453 0.15896834
0.0013507 0.00367518 0.00127416 0.00183298 0.00448067 0.01776261
0.00184872 0.00125025 0.00918616 0.00086553 0.00197497 0.01854113
0.00168007 0.01304374 0.00213133 0.00061732 0.00244949 0.0016368
0.00054001 0.00173133 0.00116409 0.00063809 0.00097648 0.00168106
0.00133609 0.0007371 0.00084775 0.00210989 0.00224028 0.00602436
0.00873917 0.00202338 0.01109634 0.00044959 0.00130168 0.00196231
0.00074593 0.00373249 0.00143229 0.00463699 0.00272285 0.00119899
0.00535349 0.00074939 0.00075856 0.00078748 0.00214877 0.00163909
0.0007553 0.0012141 0.00060405 0.00046447 0.00061247 0.00294571
0.00077433 0.00064645 0.00122328 0.00065023 0.0075631 0.00077905
0.00080332 0.00223036 0.00135002 0.00212975 0.00111877 0.00119236
0.0017247 0.00461049 0.00386357 0.00376939 0.00104736 0.00498201
0.0010248 0.00330714 0.00079378 0.00048989 0.00059584 0.00049261
0.00068026 0.00057738 0.00080852 0.00058372 0.00039259 0.00296434
0.00058196 0.00059213 0.00051131 0.0008211 0.00067767 0.00061226
0.00051419 0.00087139 0.00716098 0.00244524 0.00076278 0.00281701
0.01749129 0.00084529 0.00055982 0.00222925 0.00048428 0.00041596
0.0016672 0.00311293 0.00113994 0.0005407 0.0007626 0.00047769
0.00052031 0.00050358 0.00053626 0.0010914 0.00065376 0.00053428
0.00093926 0.00067285 0.00086375 0.00102802 0.00060578 0.00173141
0.00684521 0.00070499 0.00050233 0.00054504 0.00052275 0.00045634
0.00134417 0.00459004 0.00060823 0.00058205 0.001067 0.00062061
0.00054693 0.00113608 0.00690824 0.00381467 0.00098319 0.00079215
0.00415323 0.00094046 0.00166385 0.00116077 0.00096595 0.0008926
0.00097181 0.00071386 0.00070076 0.00065067 0.00264815 0.00179797
0.00061808 0.00046481 0.00061474 0.0005897 0.00205432 0.00043104
0.00038956 0.00097699 0.00057125 0.00050097 0.00475183 0.00075185
0.00159505 0.0013878 0.00072402 0.0032366 0.0010579 0.00186023
0.00085914 0.00365674 0.00060932 0.00110362 0.0009037 0.00060107
0.00109702 0.00201554 0.00088818 0.00177 0.00230438 0.00060168
0.00437501 0.00094664 0.00180724 0.00139438 0.00090991 0.00189076
0.00093883 0.00117496 0.00135589 0.00128304 0.01119274 0.00173013
0.00108785 0.00066931 0.00387838 0.00148438 0.00069254 0.0006568
0.00044375 0.00076008 0.00068554 0.00155672 0.00060751 0.00046874
0.00056063 0.00167375 0.00065509 0.00072051 0.00066415 0.00072419
0.00082454 0.00054486 0.00045343 0.00069155 0.00049277 0.00046386
0.00030047 0.0004769 0.00053807 0.00061284 0.00045331 0.00050338
0.00048455 0.00049045 0.00048583 0.00039619 0.00048068 0.00056404
0.00045649 0.00049807 0.0004476 0.00066544 0.00078638 0.00131371
0.00060537 0.00057138 0.0004336 0.0004147 0.00056396 0.00055417
0.00051672 0.00052329 0.00047175 0.00203486 0.00089382 0.00060031
0.00059987 0.00060181 0.00114545 0.00055001 0.00072426 0.00042265
0.00236191 0.00075691 0.00040871 0.00185986 0.00088442 0.00043235
0.00075746 0.00086747 0.0005696 0.00076044 0.00052714 0.00342677
0.00067977 0.00054517 0.00103546 0.00082098 0.00051772 0.00063276
0.00077017 0.00060814 0.00046105 0.00066417 0.00055127 0.00057197
0.00148304 0.00065776 0.0004933 0.00039011 0.00060747 0.00052778
0.00039206 0.00384451 0.00039337 0.00102246 0.0007918 0.00057602
0.00032997 0.00106976 0.0004305 0.00042372 0.00058828 0.00067668
0.00177687 0.00045052 0.00090498 0.00066602 0.00068188 0.00053053
0.00064152 0.00031669 0.00054314 0.00035669 0.0008402 0.00044964
0.00055899 0.00045394 0.00040204 0.00077508 0.00135424 0.00070711
0.00042899 0.00073185 0.00044124 0.00041545 0.00070486 0.00056407
0.00052535 0.00117377 0.00065302 0.00043444 0.000759 0.00041401
0.00045835 0.00057571 0.00051076 0.00071106 0.00045357 0.0006442
0.00055757 0.00053716 0.00042428 0.00045759 0.00049761 0.00053119
0.0005204 0.00047922 0.00045042 0.00057014 0.00037927 0.00062553
0.00047032 0.0004471 0.00050842 0.00040305 0.00059871 0.00041915
0.00045843 0.00055262 0.00043224 0.00086933 0.00046888 0.00045559
0.00046531 0.00042935 0.00042314 0.00052053 0.00037858 0.00036995
0.00055344 0.00046626 0.00049141 0.00051588 0.00055633 0.00053725
0.00047175 0.00043913 0.00093012 0.00039186 0.00054678 0.00048988
0.0005382 0.00046831 0.00042162 0.00043037 0.00044103 0.00047019
0.00050098 0.00054089 0.0004234 0.00043488 0.00046107 0.00060425
0.00046603 0.00058242 0.00044285 0.00036235 0.00050451 0.00049373
0.00037504 0.00040649 0.00045363 0.00042852 0.00038095 0.00037889
0.00054208 0.00047065 0.00055819 0.00082124 0.00043254 0.00035073
0.00041461 0.00041722 0.0005041 0.00040132 0.0005712 0.00055622
0.00074771 0.00058211 0.00036429 0.00045066 0.00049865 0.00065056
0.0005079 0.00049739 0.00062886 0.00047953 0.00037884 0.00044671
0.00042038 0.00051152 0.00040804 0.00046951 0.00047351 0.00046245
0.00043523 0.00053594 0.00037684 0.00052568 0.00054374 0.00056837
0.00056615 0.00078979 0.00032377 0.00040263 0.00047492 0.00049456
0.00042062 0.00050711 0.00045294 0.00053827 0.00058312 0.00052222
0.0003461 0.00041605 0.00043773 0.00043503 0.00046172 0.00047388
0.00036075 0.00045466 0.00043182 0.00046299 0.00053667 0.00048153
0.0004796 0.00032893 0.00049793 0.00043586 0.00046089 0.00043909
0.00047251 0.00037879 0.00039508 0.0006204 0.00047516 0.0004039
0.00054548 0.00039939 0.00040089 0.00039647 0.00046816 0.00064493
0.00043811 0.00042229 0.00047219 0.00032981 0.00042395 0.00040504
0.00055575 0.00045233 0.00038079 0.00044759 0.00047355 0.00047244
0.00038336 0.00042553 0.00033414 0.00043394 0.00063382 0.00044478
0.00056546 0.00038854 0.00043654 0.00045097 0.00044877 0.0004271
0.00044131 0.00031456 0.00046639 0.00038678 0.00055124 0.00028441
0.00052403 0.00038867 0.00061439 0.00059023 0.00051487 0.00032175
0.00058876 0.00031377 0.00065868 0.00034836 0.00037215 0.0005727
0.00039767 0.00050772 0.00069996 0.00045412 0.00045022 0.00040029
0.00042713 0.00061628 0.00036973 0.00043879 0.00036537 0.00030508
0.00044016 0.00078747 0.00051579 0.00033836 0.00039105 0.00059108
0.00054752 0.00041032 0.0005169 0.00069376 0.00063046 0.00051236
0.00037725 0.00068524 0.00050268 0.00061998 0.00047021 0.00048317
0.00052294 0.00053754 0.0004783 0.0005621 0.00046651 0.00048156
0.0005337 0.00032022 0.00061965 0.00041965 0.0005098 0.0004917
0.00049733 0.00055372 0.00057612 0.00052536 0.00037732 0.00059536
0.00041714 0.00041961 0.0003624 0.00057209 0.00034381 0.00044579
0.00051586 0.00039663 0.00071595 0.00052519 0.00055738 0.00042017
0.00041026 0.00055984 0.00035082 0.00046102 0.00039487 0.00042902
0.00062455 0.00054678 0.00045825 0.00037114 0.00040797 0.00044308
0.00044061 0.00058309 0.00056002 0.00045243 0.00032297 0.00031919
0.00033797 0.00038011 0.00037747 0.0003617 0.00057709 0.00039837
0.00045427 0.00040537 0.00048965 0.00054909 0.00038214 0.00041778
0.00043289 0.00042835 0.00049557 0.00051244 0.00046318 0.0004056
0.00041031 0.00038141 0.00047444 0.00043974 0.00037476 0.00036134
0.00032671 0.00054556 0.00039214 0.00050201 0.00045747 0.00041902
0.00043253 0.00047241 0.0004119 0.00038622 0.00050233 0.00042848
0.00043948 0.0004442 0.00039755 0.00048347 0.00048148 0.00031899
0.00039389 0.00035602 0.00053936 0.00042077 0.00061889 0.000386
0.00046729 0.00037702 0.00036526 0.00044787 0.00045822 0.00040826
0.00049037 0.0004458 0.00042419 0.00037053 0.00054917 0.0003602
0.00048142 0.0004795 0.00046837 0.0004291 0.00040798 0.00045988
0.00037559 0.00038345 0.00037479 0.00044818 0.00045333 0.00039435
0.00037874 0.00048985 0.0004038 0.00043736 0.00037463 0.0003931
0.0004233 0.0005212 0.00043266 0.00038122 0.00035645 0.00036388
0.00035413 0.00030773 0.00041858 0.00050964 0.00031148 0.0004099
0.00048401 0.00043114 0.00048213 0.00046617 0.00045688 0.00040277
0.00036055 0.00039382 0.00046754 0.00040982 0.00049309 0.00038646
0.00033985 0.00045522 0.00055988 0.00035776 0.00044065 0.00029988
0.00032753 0.00048493 0.000414 0.0003644 0.00033685 0.00040877
0.00039934 0.00036964 0.00042756 0.00048326 0.00045033 0.00041154
0.00041201 0.00053677 0.00041422 0.00036513 0.00047225 0.00041451
0.00036835 0.00042879 0.00045731 0.00045082 0.00045776 0.0003839
0.00035677 0.00049245 0.00041646 0.0004102 0.0004068 0.00042715
0.00029613 0.00033614 0.00048362 0.00040997 0.00044753 0.00035816
0.00042631 0.00041219 0.00049932 0.00036409 0.00034608 0.00044017
0.00037566 0.00044554 0.00050801 0.00044757 0.0004577 0.00035813
0.00037306 0.0005499 0.00042208 0.00040454 0.00053057 0.00028407
0.00030356 0.00038941 0.00032431 0.00035801 0.00037545 0.00046236
0.00055698 0.00045635 0.00041683 0.00028438 0.00031264 0.00046115
0.00046821 0.00041125 0.00044694 0.00033223 0.00047251 0.00047906
0.00050281 0.00042821 0.00041932 0.00042486 0.00042007 0.00040669
0.00031386 0.00044459 0.00043993 0.00047744 0.00050077 0.00046564
0.00046778 0.00044584 0.00035969 0.00043309 0.00044199 0.00040411
0.00039904 0.00052408 0.00035861 0.00041926 0.00043548 0.0004615
0.00039658 0.00043296 0.00033442 0.00046143 0.00050813 0.00039816
0.00044554 0.00030297 0.00039987 0.0003612 0.00045544 0.0003402
0.00048385 0.00038792 0.00045056 0.00039033 0.00047129 0.00046633
0.00047721 0.00038866 0.00047778 0.00040715 0.00047695 0.00038039
0.0004288 0.00041936 0.00030967 0.00035194 0.00039722 0.000382
0.00032951 0.00040688 0.00034731 0.00047104 0.00053974 0.0003595
0.00034621 0.0004059 0.00039647 0.00045877]
Training accuracy: 0.9760275073971344
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2045 180
Actual Norm 202 1244
Accuracy: 0.8959411604467448
Classification report:
precision recall f1-score support
Hypo 0.91 0.92 0.91 2225
Norm 0.87 0.86 0.87 1446
accuracy 0.90 3671
macro avg 0.89 0.89 0.89 3671
weighted avg 0.90 0.90 0.90 3671
Multilayer perceptron¶
The Multi-layer Perceptron (MLP) is a feedforward neural network trained via backpropagation. It consists of one or more fully connected hidden layers with non-linear activation functions.
- Loss: Cross-entropy
- Hidden layer activation: ReLU
- Output activation: Sigmoid for binary classification
MLPs may outperform traditional ML models when:
- There are complex nonlinear relationships that tree models or SVMs cannot easily capture
- The data is large
- There are enough training examples to avoid overfitting (since MLPs have many parameters)
Because of the last point, although we will train MLPs on both the Smart-seq and Drop-seq data sets, significant results are only expected on Drop-seq due to the much larger size compared to Smart-seq. The models trained on Smart-seq are simply for comparison purposes later on.
Key Hyperparameters¶
hidden_layer_sizes: Tuple specifying the number of neurons in each hidden layer.- Example:
(100, 50)means two hidden layers, the first with 100 neurons and the second with 50.
- Example:
alpha: L2 regularization parameter that penalizes large weights to prevent overfitting.- A larger $\alpha$ increases regularization, encouraging the model to use smaller weights.
The regularized loss function becomes:
$$ L_{\text{total}} = L_{\text{data}} + \alpha \sum_i ||W^{(i)}||^2 $$
Training¶
def train_mlp(
X_train,
y_train,
random_state: int | None = None,
n_jobs: int | None = None,
verbose: bool = True
):
if verbose:
print("========================= Training =========================")
params = {
"hidden_layer_sizes": [(200,), (100, 50), (100, 100), (200, 100, 50)],
"alpha": [1e-4, 1e-3, 1e-2, 1e-1], # L2 regularization strength
}
model = GridSearchCV(
estimator = MLPClassifier(
max_iter = 500,
random_state = random_state,
early_stopping = True,
n_iter_no_change = 10,
verbose = False,
),
param_grid = params,
refit = True,
cv = 5,
n_jobs = n_jobs,
return_train_score = True
)
model.fit(X_train, y_train)
if verbose:
summarize_crossvalidation(model)
print("Training accuracy:", model.score(X_train, y_train))
plot_learning_curve(model, list(params.keys()), log_scale_params=["alpha"])
return model.best_estimator_
Evaluation¶
def train_test_mlp(
X,
y,
test_size: float = 0.25,
train_size: float | None = None,
random_state: int = 10,
n_jobs: int | None = None,
verbose: bool = True
):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y
)
if verbose:
print("Training data dimensions:", X_train.shape)
print("Testing data dimensions:", X_test.shape)
model = train_mlp(X_train = X_train, y_train = y_train, random_state = random_state, n_jobs = n_jobs, verbose = verbose)
accuracy = test_model(model = model, X_test = X_test, y_test = y_test, verbose = verbose)
return TrainedModelWrapper(
model = model,
X = X,
y = y,
X_train = X_train,
y_train = y_train,
X_test = X_test,
y_test = y_test,
accuracy = accuracy
)
ss_mcf7_pca_mlp = train_test_mlp(X_pca_ss_mcf7, y_pca_ss_mcf7, n_jobs = -1)
Training data dimensions: (187, 20)
Testing data dimensions: (63, 20)
========================= Training =========================
Best Parameters: {'alpha': 0.0001, 'hidden_layer_sizes': (100, 100)}
Best Score (CV avg): 0.9677098150782362
Max Iterations: 500
Number of iterations for convergence: 14
Training accuracy: 0.9893048128342246
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
ss_hcc_pca_mlp = train_test_mlp(X_pca_ss_hcc, y_pca_ss_hcc, n_jobs = -1)
Training data dimensions: (136, 34)
Testing data dimensions: (46, 34)
========================= Training =========================
Best Parameters: {'alpha': 0.0001, 'hidden_layer_sizes': (100, 100)}
Best Score (CV avg): 0.8968253968253969
Max Iterations: 500
Number of iterations for convergence: 22
Training accuracy: 0.9705882352941176
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 24 1
Actual Norm 1 20
Accuracy: 0.9565217391304348
Classification report:
precision recall f1-score support
Hypo 0.96 0.96 0.96 25
Norm 0.95 0.95 0.95 21
accuracy 0.96 46
macro avg 0.96 0.96 0.96 46
weighted avg 0.96 0.96 0.96 46
As expected, the models trained on SmartSeq do not perform particularly well.
ds_mcf7_pca_mlp = train_test_mlp(X_pca_ds_mcf7, y_pca_ds_mcf7, n_jobs = -1)
Training data dimensions: (16219, 761)
Testing data dimensions: (5407, 761)
========================= Training =========================
Best Parameters: {'alpha': 0.1, 'hidden_layer_sizes': (200,)}
Best Score (CV avg): 0.9808249428818134
Max Iterations: 500
Number of iterations for convergence: 13
Training accuracy: 0.990381651149886
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2157 73
Actual Norm 28 3149
Accuracy: 0.9813205104494174
Classification report:
precision recall f1-score support
Hypo 0.99 0.97 0.98 2230
Norm 0.98 0.99 0.98 3177
accuracy 0.98 5407
macro avg 0.98 0.98 0.98 5407
weighted avg 0.98 0.98 0.98 5407
ds_hcc_pca_mlp = train_test_mlp(X_pca_ds_hcc, y_pca_ds_hcc, n_jobs = -1)
Training data dimensions: (11011, 844)
Testing data dimensions: (3671, 844)
========================= Training =========================
Best Parameters: {'alpha': 0.1, 'hidden_layer_sizes': (100, 100)}
Best Score (CV avg): 0.9560442102112428
Max Iterations: 500
Number of iterations for convergence: 39
Training accuracy: 0.9964580873671782
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2119 106
Actual Norm 61 1385
Accuracy: 0.954508308362844
Classification report:
precision recall f1-score support
Hypo 0.97 0.95 0.96 2225
Norm 0.93 0.96 0.94 1446
accuracy 0.95 3671
macro avg 0.95 0.96 0.95 3671
weighted avg 0.96 0.95 0.95 3671
Feature selection¶
Feature selection methods can be employed to reduce the number of dimensions without transforming the data (i.e. instead of PCA), thus maintaining the interpretability of each gene before training a model on the data. Furthermore, feature selection can help reduce noise and improve the generalizability of the model. They can also be used in conjunction with PCA to identify the principle components which are more important for classification.
X_ss_mcf7 = ss_mcf7_norm.T.iloc[:]
y_ss_mcf7 = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ss_mcf7_norm.columns]
X_ss_hcc = ss_hcc_norm.T.iloc[:]
y_ss_hcc = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ss_hcc_norm.columns]
X_ds_mcf7 = ds_mcf7_norm.T.iloc[:]
y_ds_mcf7 = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ds_mcf7_norm.columns]
X_ds_hcc = ds_hcc_norm.T.iloc[:]
y_ds_hcc = ["Hypo" if "hypo" in name.lower() else "Norm" for name in ds_hcc_norm.columns]
Feature selection functions¶
def get_selected_features(
pipeline: Pipeline,
X_train,
step_names: list[str]
) -> list[str]:
feature_names = X_train.columns
for name in step_names:
selector = pipeline.named_steps[name]
mask = selector.get_support()
feature_names = feature_names[mask]
return feature_names.to_list()
def get_selected_pcs_from_model(estimator: BaseEstimator, verbose: bool = True):
selector = SelectFromModel(estimator, prefit = True)
mask = selector.get_support()
pcs = [i + 1 for i in range(len(mask)) if mask[i]]
if verbose:
print(f"Top {len(pcs)} principal components:")
print(pcs)
return pcs
def count_and_sort_occurrences(feature_lists: list[list[str]], verbose: bool = True):
top_features = []
for feature_list in feature_lists:
top_features += feature_list
top_features = np.array(top_features)
unique_features, feature_counts = np.unique(top_features, return_counts = True)
top_features = np.asarray((unique_features, feature_counts)).T
top_features = top_features[top_features[:, 1].argsort()][::-1]
if verbose:
print("Feature | Occurrences")
print(top_features)
return top_features
def filter_by_occurrences(feature_list: np.ndarray, n_occurrences: int):
return [feature[0].item() for feature in feature_list if int(feature[1]) == n_occurrences]
Recursive feature elimination (RFE) with grid-search cross-validation provides comprehensive feature selection with a more compact and interpretable set of genes. However, this comprehensive feature selection is very computationally intensive. To reduce the training time, the SelectKBest selector uses ANOVA to select the k best features. This set of features is further reduced by training (linear) SVM on the data and identifying the most important features in the model. This SVM-based selection works well for linear models like SVM and logistic regression.
def train_feature_selection_svm_rfe(
X_train,
y_train,
estimator: BaseEstimator,
estimator_params: dict[str, list],
k: int | None = 1_000,
random_state: int | None = None,
n_jobs: int | None = None,
verbose: bool = True
):
"""Train a feature selection pipeline using ANOVA, SVM, and RFE.
Returns:
tuple[Pipeline, list[str]]: Best pipeline and list of selected features.
"""
if verbose:
print("========================= Training =========================")
n_samples = X_train.shape[0]
params = {f"estimator__{param}": options for param, options in estimator_params.items()}
if hasattr(estimator, "random_state") and random_state is not None:
estimator.set_params(random_state = random_state)
if hasattr(estimator, "n_jobs") and n_jobs is not None:
estimator.set_params(n_jobs = n_jobs)
univariate_selector = SelectKBest(k = k)
svm_selector = SelectFromModel(LinearSVC(C = 0.025, random_state = random_state, max_iter = 10_000))
rfe_selector = RFECV(estimator)
pipeline = Pipeline([
("univariate", univariate_selector),
("svm", svm_selector),
("rfe", rfe_selector),
("estimator", estimator)
])
pipeline = GridSearchCV(
estimator = pipeline,
param_grid = params,
refit = True,
scoring = "f1_macro",
cv = 5,
n_jobs = n_jobs,
return_train_score = True
) if n_samples < 10_000 else RandomizedSearchCV(
estimator = pipeline,
param_distributions = params,
random_state = random_state,
refit = True,
scoring = "f1_macro",
cv = 5,
n_jobs = n_jobs,
return_train_score = True,
)
pipeline.fit(X_train, y_train)
summarize_crossvalidation(pipeline)
print("Training accuracy:", pipeline.score(X_train, y_train))
plot_learning_curve(pipeline, list(params.keys()))
best_pipeline: Pipeline = pipeline.best_estimator_
selected_features = get_selected_features(best_pipeline, X_train, ["univariate", "svm", "rfe"])
if verbose:
print("Number of selected genes:", len(selected_features))
print("Selected genes:", selected_features)
return best_pipeline, selected_features
def feature_selection_svm_rfe(
X,
y,
estimator: BaseEstimator,
estimator_params: dict[str, list],
test_size: float = 0.25,
train_size: float | None = None,
random_state: int | None = 10,
n_jobs: int | None = None,
verbose: bool = True
):
"""Train and test a feature selection pipeline using ANOVA, SVM, and RFE.
Returns:
tuple[Pipeline, list[str], float]: Best pipeline, list of selected features, test accuracy.
"""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y)
if verbose:
print("Training data dimensions:", X_train.shape)
print("Testing data dimensions:", X_test.shape)
pipeline, selected_features = train_feature_selection_svm_rfe(
X_train = X_train,
y_train = y_train,
estimator = estimator,
estimator_params = estimator_params,
random_state = random_state,
n_jobs = n_jobs,
verbose = verbose
)
accuracy = test_model(pipeline, X_test, y_test, verbose)
return pipeline, selected_features, accuracy
Since random forest is not a linear model like SVM and logistic regression, an SVM pre-selector doesn't align well with the model. Random forest captures non-linear and interaction effects which may not be picked up on since strong linear variables get selected.
def train_feature_selection_random_forest(
X_train,
y_train,
estimator: BaseEstimator,
estimator_params: dict[str, list],
k: int | None = 500,
random_state: int | None = None,
n_jobs: int | None = None,
verbose: bool = True
):
"""Train a feature selection pipeline using ANOVA and random forest.
Returns:
tuple[Pipeline, list[str]]: Best pipeline and list of selected features.
"""
if verbose:
print("========================= Training =========================")
n_samples = X_train.shape[0]
params = {f"estimator__{param}": options for param, options in estimator_params.items()}
if hasattr(estimator, "random_state") and random_state is not None:
estimator.set_params(random_state = random_state)
if hasattr(estimator, "n_jobs") and n_jobs is not None:
estimator.set_params(n_jobs = n_jobs)
univariate_selector = SelectKBest(k = k)
random_forest_selector = SelectFromModel(RandomForestClassifier())
pipeline = Pipeline([
("univariate", univariate_selector),
("random_forest", random_forest_selector),
("estimator", estimator)
])
pipeline = GridSearchCV(
estimator = pipeline,
param_grid = params,
refit = True,
scoring = "f1_macro",
cv = 5,
n_jobs = n_jobs,
return_train_score = True
) if n_samples < 10_000 else RandomizedSearchCV(
estimator = pipeline,
param_distributions = params,
random_state = random_state,
refit = True,
scoring = "f1_macro",
cv = 5,
n_jobs = n_jobs,
return_train_score = True,
)
pipeline.fit(X_train, y_train)
summarize_crossvalidation(pipeline)
print("Training accuracy:", pipeline.score(X_train, y_train))
plot_learning_curve(pipeline, list(params.keys()))
best_pipeline: Pipeline = pipeline.best_estimator_
selected_features = get_selected_features(best_pipeline, X_train, ["univariate", "random_forest"])
if verbose:
print("Number of selected genes:", len(selected_features))
print("Selected genes:", selected_features)
return best_pipeline, selected_features
def feature_selection_random_forest(
X,
y,
estimator: BaseEstimator,
estimator_params: dict[str, list],
test_size: float = 0.25,
train_size: float | None = None,
random_state: int | None = 10,
n_jobs: int | None = None,
verbose: bool = True
):
"""Train and test a feature selection pipeline using ANOVA, SVM, and RFE.
Returns:
tuple[Pipeline, list[str], float]: Best pipeline, list of selected features, test accuracy.
"""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, train_size = train_size, random_state = random_state, stratify = y)
if verbose:
print("Training data dimensions:", X_train.shape)
print("Testing data dimensions:", X_test.shape)
pipeline, selected_features = train_feature_selection_random_forest(
X_train = X_train,
y_train = y_train,
estimator = estimator,
estimator_params = estimator_params,
random_state = random_state,
n_jobs = n_jobs,
verbose = verbose
)
accuracy = test_model(pipeline, X_test, y_test, verbose)
return pipeline, selected_features, accuracy
Logistic regression¶
Use the feature selection pipeline to select genes from the raw data.
ss_mcf7_logit, ss_mcf7_logit_features, ss_mcf7_logit_accuracy = feature_selection_svm_rfe(
X = X_ss_mcf7,
y = y_ss_mcf7,
estimator = LogisticRegression(max_iter = 10_000),
estimator_params = {"C": [0.1, 1, 2]},
n_jobs = -1
)
Training data dimensions: (187, 3000) Testing data dimensions: (63, 3000) ========================= Training =========================
Best Parameters: {'estimator__C': 1}
Best Score (CV avg): 0.9893277893277894
Training accuracy: 1.0
Number of selected genes: 24
Selected genes: ['CYP1B1', 'DDIT4', 'TUBA1B', 'GFRA1', 'MT-CYB', 'SLC9A3R1', 'XBP1', 'MT-CO3', 'EMP2', 'MT-CO2', 'SLC39A6', 'PGK1', 'LDHA', 'STARD10', 'MT-CO1', 'SCD', 'FLNA', 'MT-ATP6', 'DHCR7', 'SULF2', 'GATA3', 'DDX5', 'NME1-NME2', 'ALDOA']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
ss_hcc_logit, ss_hcc_logit_features, ss_hcc_logit_accuracy = feature_selection_svm_rfe(
X = X_ss_hcc,
y = y_ss_hcc,
estimator = LogisticRegression(max_iter = 10_000),
estimator_params = {"C": [0.1, 1, 2]},
n_jobs = -1
)
Training data dimensions: (136, 3000) Testing data dimensions: (46, 3000) ========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9775925925925926
Training accuracy: 1.0
Number of selected genes: 166
Selected genes: ['DDIT4', 'ANGPTL4', 'CCNB1', 'IGFBP3', 'AKR1C2', 'NDRG1', 'KRT4', 'FN1', 'MMP1', 'SPP1', 'EGLN3', 'CA9', 'CDC20', 'AURKA', 'PLIN2', 'UPK1B', 'AKR1C1', 'FOS', 'LAMB3', 'LY6D', 'H4C3', 'AKR1C3', 'TPX2', 'PLAU', 'CXCL1', 'FAM83A', 'BNIP3', 'INSIG1', 'KRT19', 'BHLHE40', 'TXNIP', 'THBS1', 'ALDOC', 'ID3', 'CEACAM5', 'FTH1', 'GPRC5A', 'CCNB2', 'KPNA2', 'FTL', 'PLK2', 'DKK1', 'KCTD11', 'SLC2A1', 'CLDN4', 'KIF23', 'PGK1', 'SLC6A8', 'KIF2C', 'LOXL2', 'CHAC1', 'SPAG5', 'F3', 'WTAPP1', 'CSTB', 'HSPA5', 'DHCR7', 'HERPUD1', 'FGFBP1', 'CDKN1A', 'PFKFB3', 'DHRS3', 'LDHA', 'SLCO4A1', 'KDM5B', 'KRT8', 'PRC1', 'ADM', 'KNSTRN', 'FDFT1', 'CKS2', 'TMSB10', 'SLC38A2', 'CD44', 'FOSL2', 'JUP', 'KYNU', 'ALDH1A3', 'S100A2', 'KRT18', 'ZWINT', 'PRSS23', 'HBP1', 'SQSTM1', 'MYC', 'JUNB', 'H1-0', 'C10orf55', 'MSMO1', 'ERO1A', 'SRXN1', 'CKAP2', 'TFRC', 'SEMA4B', 'ITGA6', 'EIF5', 'P4HA1', 'TRIM29', 'SLC20A1', 'TRIM16', 'CDC6', 'IRF6', 'HMGCS1', 'GPX2', 'GPI', 'HSPA8', 'ISG15', 'ALDOA', 'CAV1', 'BIRC5', 'TXN', 'TUBB', 'PCDH1', 'TUBB4B', 'MT-CO3', 'ACAT2', 'POLR2A', 'IER2', 'AMOTL2', 'FSCN1', 'MT-CYB', 'BLCAP', 'PLOD2', 'TUBA1B', 'HES1', 'NQO1', 'DCBLD2', 'HSP90B1', 'FYB1', 'UGDH', 'LMNA', 'MRNIP', 'HRH1', 'PCNA', 'PRNP', 'BNIP3L', 'TPBG', 'C4orf3', 'MT-RNR1', 'EGLN1', 'PRDX1', 'UBB', 'HMGA1', 'PSMD2', 'NUP188', 'HSP90AA1', 'NRP1', 'SRM', 'HSPH1', 'BAG3', 'MIF-AS1', 'MIF', 'HLA-A', 'LDHB', 'CDK2AP2', 'PERP', 'EIF4A2', 'PPIF', 'FUT11', 'FAM162A', 'TYSND1', 'CLDN7', 'P4HA2', 'CANX', 'NOLC1', 'VCPIP1']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 24 1
Actual Norm 0 21
Accuracy: 0.9782608695652174
Classification report:
precision recall f1-score support
Hypo 1.00 0.96 0.98 25
Norm 0.95 1.00 0.98 21
accuracy 0.98 46
macro avg 0.98 0.98 0.98 46
weighted avg 0.98 0.98 0.98 46
ds_mcf7_logit, ds_mcf7_logit_features, ds_mcf7_logit_accuracy = feature_selection_svm_rfe(
X = X_ds_mcf7,
y = y_ds_mcf7,
estimator = LogisticRegression(max_iter = 10_000),
estimator_params = {"C": [0.1, 1, 2]},
n_jobs = -1
)
Training data dimensions: (16219, 3000)
Testing data dimensions: (5407, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9732560929550514
Training accuracy: 0.9826839329314294
Number of selected genes: 377
Selected genes: ['MT-RNR2', 'TFF1', 'MT-RNR1', 'GDF15', 'MT-CO3', 'MT-ND4', 'MT-ND3', 'MT-CYB', 'IGFBP5', 'TMSB10', 'MT-ATP6', 'MT-CO2', 'MT-TS1', 'MT-ND6', 'MT-ND2', 'MTND1P23', 'MT-ND4L', 'MT-ND5', 'MT-TN', 'MT-ND1', 'MT-TA', 'MT-TQ', 'HES1', 'MT-TM', 'LGALS1', 'TMEM64', 'MTND2P28', 'MT-TE', 'H19', 'MT-ATP8', 'H2AC12', 'NCDN', 'MT-TY', 'TOB1', 'H2AC20', 'MT-TP', 'ANKRD52', 'MT-TD', 'C16orf91', 'ATN1', 'WSB2', 'GPM6A', 'ZFP36', 'VMP1', 'TFF3', 'KMT2D', 'FGF23', 'CRTC2', 'CSK', 'PLBD2', 'ITPK1', 'PLEC', 'GOLGA4', 'PLCD3', 'PTP4A2', 'TIAM1', 'SOX4', 'BTBD9', 'H2AC11', 'CBFA2T3', 'PROSER1', 'ARF3', 'PARD6B', 'RPL13', 'TPI1', 'BTN3A2', 'GREM1', 'RNF146', 'S100A10', 'CHAC2', 'ATXN2L', 'TGFB3', 'MGRN1', 'CAPZA1', 'FAM189B', 'GSE1', 'CERS2', 'ENO1', 'SLC48A1', 'PKIB', 'RHOD', 'BLOC1S3', 'KRT19', 'RPL34', 'TCF20', 'LINC01291', 'FAM102A', 'PRRG3', 'GABPB2', 'CAMK2N1', 'VPS9D1-AS1', 'TAF13', 'INCENP', 'ZNRF1', 'NINJ1', 'ZBTB34', 'DSP', 'ZNF480', 'CALHM2', 'MSMB', 'KCNJ2', 'ZBTB20', 'TPD52L1', 'HSPH1', 'HMGA1', 'CASP8AP2', 'ZNF302', 'ELOA', 'GPATCH4', 'SNX24', 'DVL3', 'SNX27', 'YTHDF3', 'GAB2', 'PACS1', 'NLK', 'THAP1', 'KCNJ3', 'LDLRAP1', 'TRAK2', 'CAMSAP2', 'PPM1G', 'NCALD', 'LRRFIP2', 'DNAJA1', 'SMKR1', 'MAPKAPK2', 'ZNF702P', 'NACC1', 'TRIM37', 'RFK', 'FBXL16', 'TCHP', 'ISCU', 'RABEP1', 'CACNG4', 'RPSAP48', 'WWC3', 'GDAP2', 'SRCAP', 'USP32', 'FLOT2', 'MAFF', 'NCOA1', 'TWNK', 'AKAP5', 'NEDD4L', 'APOOL', 'CCDC18', 'RAB27A', 'BRPF3', 'BCAS3', 'GATAD2A', 'NSD1', 'NPM1P40', 'ANKRD40', 'ILRUN', 'PSMD14', 'STRBP', 'TPM1', 'CAV1', 'MPHOSPH9', 'ANXA6', 'PRXL2C', 'CDC25B', 'KIF14', 'PYGO2', 'ZNF688', 'KHSRP', 'BAP1', 'MDM2', 'RAB5C', 'PAQR8', 'SOS1', 'KRT80', 'SECISBP2L', 'BOLA3', 'DNAJA4', 'THRB', 'ARPP19', 'S100A11', 'FRS2', 'RGPD4-AS1', 'BRIP1', 'PRR12', 'TEDC2-AS1', 'RPL15', 'DKC1', 'C9orf78', 'NBEAL2', 'SETD3', 'FEM1A', 'SLC25A24', 'ARMC6', 'SLC13A5', 'CFAP97', 'NEDD1', 'PHLDA2', 'MARK3', 'SPATS2L', 'PAPOLA', 'MT2A', 'ZNF354A', 'SET', 'ATXN1L', 'SCYL2', 'ZNF703', 'SRFBP1', 'UBA52', 'MGLL', 'LAD1', 'ZC3H15', 'SLC25A48', 'RAD23A', 'EIF4G2', 'HOXC13', 'PITPNA', 'TAF9B', 'LXN', 'SERINC5', 'FBRS', 'SMC5', 'RAI14', 'TRIM44', 'MYO5C', 'AKT1S1', 'TBKBP1', 'EIF2B4', 'PRR34-AS1', 'PSME4', 'PDAP1', 'ARHGAP26', 'ELP3', 'SENP6', 'DNAJC21', 'FAM104A', 'CS', 'ABL1', 'EIF3A', 'H2AX', 'MARK2', 'LCLAT1', 'S100P', 'RCC1L', 'ANKRD17', 'TMEM259', 'RAB1B', 'GAPDH', 'TMEM258', 'SSX2IP', 'PDS5A', 'FAM177A1', 'NAA10', 'CNOT9', 'PGK1', 'PKM', 'KLHL8', 'BCL3', 'PRMT6', 'CACNA1A', 'GOLGA3', 'SOCS2', 'PPP1R12B', 'DCTN1', 'C7orf50', 'ZMIZ1', 'PGAM5', 'RPL30', 'ARNTL2', 'PREX1', 'LYAR', 'PRRC2C', 'PCYT1A', 'GLE1', 'ZFC3H1', 'BMPR1B', 'RBBP6', 'ZNF764', 'RAB35', 'ENOX2', 'LMNB2', 'ZNF326', 'ARID1B', 'TIMELESS', 'PFDN4', 'LPP', 'SYNE2', 'ZRANB1', 'PLCB4', 'CBX3', 'NOL4L', 'SPRY1', 'RPS6KA6', 'CKS2', 'SMC6', 'AURKA', 'BICDL1', 'DBNDD1', 'CRNDE', 'C2orf49', 'TPM4', 'FAM111B', 'KPNA2', 'NCKAP1', 'INF2', 'CSNK2A2', 'FARP1', 'MAP3K13', 'NCOA5', 'DNAJA3', 'RRP1B', 'HCFC1', 'ACTB', 'GATA3', 'DHX38', 'TSPYL1', 'TBC1D9', 'IWS1', 'FAM50A', 'AFF1', 'WDR43', 'SHISA5', 'CLTB', 'ETF1', 'RSRC2', 'GNAQ', 'BAZ2A', 'TARS1', 'KCNQ1OT1', 'YWHAB', 'EBAG9', 'PITX1', 'KPNA4', 'UTP18', 'PSMD5', 'NFIC', 'PHF20L1', 'PATL1', 'POLB', 'TNIP2', 'ARIH1', 'KLC1', 'ZBTB7A', 'NPLOC4', 'ARFGEF1', 'TRAF3IP2', 'BMS1', 'INPP4B', 'MYH14', 'KITLG', 'ATF5', 'TBCA', 'PICALM', 'FAM13B', 'FBXL18', 'MYO10', 'TAOK3', 'BBOF1', 'CLSPN', 'PAK2', 'STRIP1', 'IFI27L2', 'LTBR', 'ESRP2', 'C6orf62', 'AAMP', 'PMEPA1', 'UBE2Q2', 'DHX37', 'SLAIN2', 'OTUD7B', 'RPL23', 'NCBP3', 'ATRX', 'CCM2', 'NOM1', 'SMIM27']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2141 89
Actual Norm 54 3123
Accuracy: 0.9735528019234326
Classification report:
precision recall f1-score support
Hypo 0.98 0.96 0.97 2230
Norm 0.97 0.98 0.98 3177
accuracy 0.97 5407
macro avg 0.97 0.97 0.97 5407
weighted avg 0.97 0.97 0.97 5407
ds_hcc_logit, ds_hcc_logit_features, ds_hcc_logit_accuracy = feature_selection_svm_rfe(
X = X_ds_hcc,
y = y_ds_hcc,
estimator = LogisticRegression(max_iter = 10_000),
estimator_params = {"C": [0.1, 1, 2]},
n_jobs = -1
)
Training data dimensions: (11011, 3000) Testing data dimensions: (3671, 3000) ========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9421202743207646
Training accuracy: 0.9611172478461597
Number of selected genes: 399
Selected genes: ['BCYRN1', 'IGFBP3', 'H2AC11', 'RAPGEF3', 'DDIT4', 'MB', 'MT-TV', 'ADAP1', 'MT-TL1', 'NDRG1', 'MIR210HG', 'MT-TQ', 'SPN', 'ZNF263', 'MT-CO3', 'CACHD1', 'BTBD9', 'GDPGP1', 'ARTN', 'LINC01304', 'CACNB2', 'HELQ', 'NAXD', 'GPM6A', 'MT-TS1', 'MT-TA', 'CRIP2', 'NEAT1', 'DUSP9', 'PRR5L', 'USP35', 'MT-ND1', 'MT-CYB', 'H19', 'H2AC12', 'CNR2', 'FGF23', 'DANT1', 'AKR1C2', 'TMSB10', 'EHBP1L1', 'LDHA', 'MT-ND6', 'EFNA2', 'CITED2', 'H4C5', 'CNOT6L', 'CLDN4', 'CPNE2', 'DTNB', 'GABRE', 'LINC02511', 'MT-ATP6', 'LGALS1', 'NUPR2', 'H2AC16', 'MPDU1', 'ATXN2L', 'PGAM1', 'PPIL1', 'NCL', 'COL6A3', 'ABO', 'RPL17', 'SLC2A1', 'C4orf3', 'NOP10', 'KLC2', 'MT-CO2', 'NCALD', 'EGLN3', 'OPTN', 'ANKRD9', 'TRAK1', 'CNNM2', 'RRAS', 'BNIP3', 'ENTR1', 'FGF8', 'B4GALT1', 'GPI', 'LIMCH1', 'MIR663AHG', 'SREK1IP1P1', 'FBXL17', 'LINC02541', 'P4HA1', 'H2BC4', 'RPL41', 'COX8A', 'MT-TS2', 'PROSER1', 'PGK1', 'CAMK2N1', 'HEPACAM', 'MSR1', 'GDI1', 'SIGMAR1', 'AHNAK2', 'GCAT', 'SINHCAFP3', 'CTXN1', 'LINC01133', 'POLR3GL', 'HES4', 'PDCD4', 'TNFRSF12A', 'ENKD1', 'SHOX', 'RGPD4-AS1', 'HIF3A', 'S100A10', 'APOOL', 'RTL8C', 'ARNTL', 'HMGB2', 'NEDD9', 'TMEM70', 'FASTKD5', 'DAAM1', 'HSP90AB1', 'ZBED2', 'EFNA5', 'PSMG1', 'TMSB4XP4', 'NPM1P40', 'RPL39', 'AJAP1', 'SAMD4A', 'WDR77', 'PAQR7', 'NDUFB4', 'BTN3A2', 'VIT', 'ARHGDIA', 'H3C2', 'FOSL2', 'MIXL1', 'MCM3AP', 'GJB3', 'PRRC2A', 'FSD1L', 'IVL', 'KCNJ3', 'BNIP3L', 'S100A11', 'BMPR1B', 'H2BC9', 'TNNT1', 'CEP120', 'LINC02367', 'RAB30', 'ZBED4', 'RAB11FIP4', 'RNF122', 'NEDD4L', 'RAB2B', 'RPS27', 'CSTB', 'C1orf53', 'NCK1', 'CPEB1', 'MLLT3', 'MELTF-AS1', 'TCF7L1', 'NT5C', 'MT1E', 'RPSAP48', 'TNFSF13B', 'ECH1', 'NDUFA8', 'MIOS-DT', 'KRT19', 'ZNF318', 'POLDIP2', 'VPS45', 'ZNF418', 'YTHDF3', 'MT-ND4L', 'PI4KB', 'ADARB1', 'AXL', 'CACNA1A', 'TUBB6', 'NRG4', 'NMD3', 'FAM126B', 'PHACTR1', 'TXNRD2', 'BAP1', 'HSPD1', 'PLD1', 'JAKMIP3', 'DDX23', 'RPL28', 'ANKEF1', 'RPS6KA6', 'DUSP5', 'SH3RF1', 'ARHGEF26', 'SLC6A8', 'JUN', 'OVOL1', 'APEH', 'CAVIN3', 'ZNF302', 'DCAKD', 'ARL2', 'LINC01902', 'RBSN', 'CREB1', 'TATDN2', 'PRRG3', 'RPS21', 'ALDOC', 'MMP2', 'POLE4', 'PTGR1', 'CCDC168', 'GBP1P1', 'TSHZ2', 'IRF2BPL', 'ADM', 'CAST', 'RPS29', 'AKR1C1', 'PCDHGA10', 'RGS10', 'TGDS', 'EPHX1', 'KAT7', 'NEUROD2', 'CFAP251', 'MXRA5', 'PFKFB3', 'PLOD2', 'PPTC7', 'ING2', 'CD47', 'ZNF33B', 'KIRREL1', 'KDM3A', 'UQCC2', 'FUT11', 'MXI1', 'MED18', 'SYNJ2', 'SNHG18', 'RNF25', 'AKT1S1', 'KLLN', 'NCAM1', 'RAB12', 'PDLIM1', 'MT1X', 'DERA', 'YTHDF1', 'AMFR', 'CEP83', 'SF3B4', 'POLR3A', 'PHRF1', 'GYS1', 'SRA1', 'EPPK1', 'SYT14', 'FAM162A', 'KCNJ2', 'ARMC6', 'MKNK1', 'HSP90AA1', 'INHBA', 'FYN', 'BTBD7P1', 'CENPB', 'RHBDD2', 'SNX22', 'SLC2A6', 'LINC01116', 'ISOC2', 'MPHOSPH6', 'JUND', 'RAB3GAP1', 'MNS1', 'DTYMK', 'TOLLIP', 'GIN1', 'FAH', 'GOLGA4', 'TMEM256', 'DGKD', 'WDR43', 'CAMSAP2', 'NACA4P', 'ARHGAP42', 'NDUFC1', 'GAPDH', 'TMEM238', 'GRK2', 'DNAH11', 'ZBTB2', 'TRIM44', 'CIAO2A', 'UTP3', 'CALM2', 'BRMS1', 'PCDHB1', 'TTL', 'FOSL1', 'YKT6', 'ACSL4', 'CCDC34', 'SAT2', 'RHOT2', 'MAD2L1', 'DBT', 'RPL27A', 'RPL37A', 'NUP93', 'AMOTL2', 'PPP4R2', 'CARM1', 'VEGFB', 'NCLN', 'MLLT6', 'MAP2K3', 'DNAAF5', 'PSMA7', 'DDX54', 'TCEAL9', 'RPLP0P2', 'KRT4', 'SNORD3B-1', 'FEM1A', 'TRIM52-AS1', 'MCM4', 'CCNG2', 'YWHAZ', 'ARID5B', 'MRPL55', 'KMT2D', 'SPG21', 'ZC3H15', 'EMP2', 'LETM1', 'EIF3J', 'SNRNP70', 'RHOD', 'MAFF', 'MAZ', 'UQCR11', 'PLCE1', 'CPTP', 'ARHGEF7', 'STMN1', 'ZNF202', 'SNHG9', 'HMGA1', 'CLIC1', 'ZHX1', 'TSPO', 'TPD52L1', 'FRY', 'DNMT3A', 'ARL13B', 'SMARCB1', 'RRS1', 'HEY1', 'SLC25A48', 'TMEM80', 'DYSF', 'MTA2', 'C19orf53', 'ARSA', 'DGKZ', 'VRK3', 'UIMC1', 'PSIP1', 'ZNF688', 'CMIP', 'PPIG', 'EXOC7', 'TAF15', 'MARCKS', 'AK4', 'KIF5B', 'ATP5F1E', 'IRAK1', 'BRAT1', 'TSR1', 'SART1', 'CAP1', 'SETD2', 'METTL26', 'STC2', 'DDIT3', 'KEAP1', 'DLD', 'CLIP2']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2115 110
Actual Norm 94 1352
Accuracy: 0.944429310814492
Classification report:
precision recall f1-score support
Hypo 0.96 0.95 0.95 2225
Norm 0.92 0.93 0.93 1446
accuracy 0.94 3671
macro avg 0.94 0.94 0.94 3671
weighted avg 0.94 0.94 0.94 3671
Select the top principal components from the models trained on PCA encoded data.
print("SmartSeq MCF7 Logistic Regression")
ss_mcf7_pca_logit_pcs = get_selected_pcs_from_model(ss_mcf7_pca_logit.model)
print()
print("SmartSeq HCC Logistic Regression")
ss_hcc_pca_logit_pcs = get_selected_pcs_from_model(ss_hcc_pca_logit.model)
print()
print("DropSeq MCF7 Logistic Regression")
ds_mcf7_pca_logit_pcs = get_selected_pcs_from_model(ds_mcf7_pca_logit.model)
print()
print("DropSeq HCC Logistic Regression")
ds_hcc_pca_logit_pcs = get_selected_pcs_from_model(ds_hcc_pca_logit.model)
print()
SmartSeq MCF7 Logistic Regression Top 9 principal components: [1, 3, 6, 8, 12, 15, 16, 17, 18] SmartSeq HCC Logistic Regression Top 10 principal components: [2, 3, 9, 10, 12, 13, 16, 17, 23, 26] DropSeq MCF7 Logistic Regression Top 310 principal components: [1, 2, 3, 5, 6, 8, 13, 14, 15, 16, 17, 18, 19, 20, 25, 26, 27, 28, 29, 30, 31, 32, 33, 36, 37, 40, 43, 44, 45, 46, 52, 54, 55, 56, 57, 59, 60, 61, 62, 65, 66, 69, 71, 74, 81, 82, 85, 87, 88, 91, 92, 94, 95, 96, 99, 100, 104, 105, 107, 110, 112, 114, 115, 116, 118, 119, 120, 121, 127, 128, 133, 135, 138, 140, 141, 142, 145, 146, 147, 149, 153, 157, 160, 161, 167, 170, 172, 173, 175, 177, 182, 186, 188, 190, 191, 193, 195, 197, 198, 200, 201, 203, 204, 205, 206, 211, 212, 213, 218, 219, 221, 230, 231, 232, 234, 235, 236, 239, 240, 241, 247, 249, 252, 253, 254, 258, 263, 264, 265, 267, 269, 271, 273, 275, 279, 281, 282, 286, 287, 291, 293, 302, 305, 312, 317, 318, 319, 322, 323, 327, 332, 337, 339, 341, 342, 344, 348, 350, 352, 353, 361, 364, 365, 370, 371, 375, 376, 377, 380, 381, 383, 385, 387, 389, 391, 392, 393, 398, 399, 400, 401, 402, 403, 406, 408, 409, 411, 415, 418, 419, 420, 426, 427, 429, 430, 431, 433, 434, 435, 436, 437, 438, 442, 446, 449, 455, 459, 460, 461, 462, 464, 466, 467, 469, 470, 471, 475, 481, 484, 485, 486, 487, 491, 494, 495, 496, 497, 499, 504, 506, 507, 508, 510, 512, 514, 517, 520, 522, 527, 534, 538, 540, 541, 543, 546, 552, 555, 556, 557, 564, 565, 576, 577, 580, 582, 585, 591, 592, 596, 597, 598, 599, 602, 606, 610, 612, 615, 621, 623, 624, 626, 631, 632, 633, 642, 646, 647, 650, 652, 653, 655, 658, 661, 666, 672, 674, 675, 677, 681, 682, 685, 696, 698, 700, 702, 718, 722, 724, 726, 732, 733, 734, 741, 742, 746, 751, 754, 755, 756, 758] DropSeq HCC Logistic Regression Top 320 principal components: [2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 15, 16, 18, 19, 20, 21, 23, 24, 26, 27, 29, 30, 31, 32, 34, 36, 37, 38, 39, 41, 45, 46, 47, 48, 49, 53, 54, 55, 60, 63, 65, 69, 72, 76, 77, 80, 86, 88, 89, 90, 92, 94, 95, 96, 97, 99, 102, 103, 106, 115, 117, 118, 120, 121, 123, 124, 125, 126, 127, 131, 135, 136, 137, 139, 140, 141, 142, 143, 145, 147, 148, 151, 152, 153, 154, 155, 157, 159, 161, 162, 166, 167, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 187, 189, 190, 191, 193, 197, 198, 199, 200, 201, 205, 207, 208, 210, 213, 215, 217, 218, 219, 220, 221, 224, 225, 227, 229, 230, 231, 233, 234, 235, 237, 238, 239, 240, 243, 245, 247, 249, 254, 255, 257, 259, 260, 261, 262, 263, 265, 266, 267, 270, 272, 275, 282, 290, 292, 294, 295, 297, 300, 301, 303, 308, 310, 312, 313, 315, 318, 319, 325, 326, 329, 332, 334, 336, 339, 341, 344, 345, 350, 351, 356, 357, 358, 359, 360, 362, 369, 371, 372, 373, 374, 375, 377, 379, 380, 383, 384, 391, 396, 399, 401, 403, 408, 409, 412, 413, 414, 415, 420, 423, 427, 438, 439, 444, 445, 447, 448, 450, 451, 453, 457, 461, 466, 474, 479, 490, 494, 499, 503, 507, 509, 515, 516, 521, 525, 528, 540, 550, 551, 561, 562, 563, 564, 566, 567, 575, 576, 577, 581, 582, 583, 584, 586, 594, 598, 600, 603, 604, 609, 610, 612, 613, 622, 630, 637, 640, 641, 642, 651, 653, 657, 672, 691, 692, 693, 698, 700, 701, 713, 717, 718, 730, 746, 747, 749, 756, 758, 760, 761, 763, 766, 769, 777, 781, 783, 784, 785, 789, 792, 793, 794, 799, 807, 808, 809, 810, 813, 814, 831, 835, 841, 842, 843]
SVM¶
Use the feature selection pipeline to select genes from the raw data.
ss_mcf7_svm, ss_mcf7_svm_features, ss_mcf7_svm_accuracy = feature_selection_svm_rfe(
X = X_ss_mcf7,
y = y_ss_mcf7,
estimator = LinearSVC(max_iter = 10_000),
estimator_params = {"C": [0.025, 0.1, 1, 5]},
n_jobs = -1
)
Training data dimensions: (187, 3000) Testing data dimensions: (63, 3000) ========================= Training =========================
Best Parameters: {'estimator__C': 0.025}
Best Score (CV avg): 1.0
Training accuracy: 1.0
Number of selected genes: 33
Selected genes: ['DDIT4', 'NR4A1', 'FOS', 'STC2', 'HILPDA', 'MCM7', 'MT-CYB', 'TMEM64', 'XBP1', 'CRABP2', 'MT-CO3', 'EMP2', 'MT-CO2', 'PGK1', 'LDHA', 'STARD10', 'MT-CO1', 'DYNC2I2', 'FLNA', 'TMSB10', 'IFITM3', 'DSP', 'FAM162A', 'SULF2', 'QSOX1', 'ARPC1B', 'SYTL2', 'PSAP', 'CD9', 'HNRNPA2B1', 'GATA3', 'ATP9A', 'NME1-NME2']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
ss_hcc_svm, ss_hcc_svm_features, ss_hcc_svm_accuracy = feature_selection_svm_rfe(
X = X_ss_hcc,
y = y_ss_hcc,
estimator = LinearSVC(max_iter = 10_000),
estimator_params = {"C": [0.025, 0.1, 1, 5]},
n_jobs = -1
)
Training data dimensions: (136, 3000) Testing data dimensions: (46, 3000) ========================= Training =========================
Best Parameters: {'estimator__C': 0.025}
Best Score (CV avg): 0.9850189600540231
Training accuracy: 1.0
Number of selected genes: 21
Selected genes: ['DDIT4', 'ANGPTL4', 'AKR1C2', 'MMP1', 'CDC20', 'AKR1C3', 'PLAU', 'DKK1', 'PGK1', 'HSPA5', 'LDHA', 'CD44', 'HSPA8', 'ALDOA', 'CAV1', 'TXN', 'MT-CYB', 'TUBA1B', 'NQO1', 'PRDX1', 'PSMD2']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 24 1
Actual Norm 0 21
Accuracy: 0.9782608695652174
Classification report:
precision recall f1-score support
Hypo 1.00 0.96 0.98 25
Norm 0.95 1.00 0.98 21
accuracy 0.98 46
macro avg 0.98 0.98 0.98 46
weighted avg 0.98 0.98 0.98 46
ds_mcf7_svm, ds_mcf7_svm_features, ds_mcf7_svm_accuracy = feature_selection_svm_rfe(
X = X_ds_mcf7,
y = y_ds_mcf7,
estimator = LinearSVC(max_iter = 10_000),
estimator_params = {"C": [0.1, 1, 5]},
n_jobs = -1
)
Training data dimensions: (16219, 3000)
Testing data dimensions: (5407, 3000)
========================= Training =========================
Best Parameters: {'estimator__C': 0.1}
Best Score (CV avg): 0.9694665289633493
Training accuracy: 0.9858094450744297
Number of selected genes: 380
Selected genes: ['MT-RNR2', 'TFF1', 'MT-RNR1', 'GDF15', 'MT-CO3', 'MT-ND4', 'MT-ND3', 'MT-CYB', 'IGFBP5', 'TMSB10', 'MT-ATP6', 'MT-CO2', 'MT-TS1', 'MT-ND6', 'MT-ND2', 'MTND1P23', 'MT-ND4L', 'MT-ND5', 'MT-TN', 'MT-ND1', 'MT-TA', 'MT-TQ', 'HES1', 'MT-TM', 'LGALS1', 'TMEM64', 'MTND2P28', 'MT-TE', 'H19', 'MT-ATP8', 'H2AC12', 'NCDN', 'MT-TY', 'TOB1', 'H2AC20', 'MT-TP', 'ANKRD52', 'MT-TD', 'C16orf91', 'ATN1', 'WSB2', 'GPM6A', 'ZFP36', 'VMP1', 'TFF3', 'KMT2D', 'FGF23', 'CRTC2', 'CSK', 'PLBD2', 'ITPK1', 'PLEC', 'GOLGA4', 'PLCD3', 'PTP4A2', 'TIAM1', 'SOX4', 'BTBD9', 'H2AC11', 'CBFA2T3', 'PROSER1', 'ARF3', 'PARD6B', 'RPL13', 'TPI1', 'BTN3A2', 'GREM1', 'RNF146', 'S100A10', 'CHAC2', 'ATXN2L', 'TGFB3', 'MGRN1', 'CAPZA1', 'FAM189B', 'GSE1', 'CERS2', 'ENO1', 'SLC48A1', 'PKIB', 'RHOD', 'BLOC1S3', 'KRT19', 'RPL34', 'TCF20', 'LINC01291', 'FAM102A', 'PRRG3', 'GABPB2', 'CAMK2N1', 'VPS9D1-AS1', 'TAF13', 'INCENP', 'ZNRF1', 'NINJ1', 'ZBTB34', 'DSP', 'ZNF480', 'CALHM2', 'MSMB', 'SENP3', 'KCNJ2', 'ZBTB20', 'TPD52L1', 'HSPH1', 'HMGA1', 'CASP8AP2', 'ZNF302', 'ELOA', 'GPATCH4', 'SNX24', 'DVL3', 'SNX27', 'YTHDF3', 'GAB2', 'PACS1', 'NLK', 'THAP1', 'KCNJ3', 'LDLRAP1', 'TRAK2', 'CAMSAP2', 'PPM1G', 'HEPACAM', 'NCALD', 'LRRFIP2', 'DNAJA1', 'SMKR1', 'MAPKAPK2', 'ZNF702P', 'NACC1', 'TRIM37', 'RFK', 'FBXL16', 'TCHP', 'ISCU', 'RABEP1', 'CACNG4', 'RPSAP48', 'WWC3', 'GDAP2', 'SRCAP', 'USP32', 'FLOT2', 'MAFF', 'NCOA1', 'TWNK', 'AKAP5', 'NEDD4L', 'APOOL', 'CCDC18', 'RAB27A', 'BRPF3', 'BCAS3', 'GATAD2A', 'NSD1', 'NPM1P40', 'ANKRD40', 'ILRUN', 'PSMD14', 'STRBP', 'TPM1', 'CAV1', 'MPHOSPH9', 'ANXA6', 'PRXL2C', 'KIF14', 'PYGO2', 'ZNF688', 'KHSRP', 'BAP1', 'MDM2', 'RAB5C', 'PAQR8', 'SOS1', 'KRT80', 'SECISBP2L', 'BOLA3', 'DNAJA4', 'THRB', 'ARPP19', 'S100A11', 'FRS2', 'RGPD4-AS1', 'BRIP1', 'PRR12', 'TEDC2-AS1', 'RPL15', 'DKC1', 'C9orf78', 'NBEAL2', 'SETD3', 'FEM1A', 'SLC25A24', 'ARMC6', 'SLC13A5', 'CFAP97', 'NEDD1', 'PHLDA2', 'MARK3', 'SPATS2L', 'PAPOLA', 'MT2A', 'ZNF354A', 'SET', 'ATXN1L', 'SCYL2', 'ZNF703', 'SRFBP1', 'UBA52', 'MGLL', 'LAD1', 'ZC3H15', 'SLC25A48', 'RAD23A', 'EIF4G2', 'HOXC13', 'PITPNA', 'TAF9B', 'LXN', 'SERINC5', 'FBRS', 'SMC5', 'RAI14', 'TRIM44', 'MYO5C', 'AKT1S1', 'TBKBP1', 'EIF2B4', 'PRR34-AS1', 'PSME4', 'PDAP1', 'ARHGAP26', 'ELP3', 'SENP6', 'DNAJC21', 'FAM104A', 'CS', 'ABL1', 'EIF3A', 'MARK2', 'LCLAT1', 'S100P', 'RCC1L', 'ANKRD17', 'TMEM259', 'CPEB4', 'RAB1B', 'GAPDH', 'TMEM258', 'SSX2IP', 'PDS5A', 'FAM177A1', 'NAA10', 'CNOT9', 'PGK1', 'PKM', 'KLHL8', 'BCL3', 'PRMT6', 'CACNA1A', 'GOLGA3', 'SOCS2', 'HPCAL1', 'PPP1R12B', 'DCTN1', 'C7orf50', 'ZMIZ1', 'PGAM5', 'RPL30', 'ARNTL2', 'PREX1', 'LYAR', 'PRRC2C', 'PCYT1A', 'GLE1', 'ZFC3H1', 'BMPR1B', 'RBBP6', 'ZNF764', 'RAB35', 'ENOX2', 'LMNB2', 'ZNF326', 'ARID1B', 'TIMELESS', 'PFDN4', 'LPP', 'SYNE2', 'ZRANB1', 'PLCB4', 'CBX3', 'NOL4L', 'SPRY1', 'RPS6KA6', 'CKS2', 'SMC6', 'AURKA', 'BICDL1', 'DBNDD1', 'CRNDE', 'C2orf49', 'TPM4', 'FAM111B', 'KPNA2', 'NCKAP1', 'INF2', 'CSNK2A2', 'FARP1', 'FGD5-AS1', 'MAP3K13', 'NCOA5', 'DNAJA3', 'RRP1B', 'HCFC1', 'ACTB', 'GATA3', 'DHX38', 'TSPYL1', 'TBC1D9', 'IWS1', 'FAM50A', 'AFF1', 'WDR43', 'SHISA5', 'CLTB', 'ETF1', 'RSRC2', 'GNAQ', 'BAZ2A', 'TARS1', 'KCNQ1OT1', 'YWHAB', 'EBAG9', 'PITX1', 'KPNA4', 'UTP18', 'PSMD5', 'NFIC', 'PHF20L1', 'PATL1', 'POLB', 'TNIP2', 'ARIH1', 'KLC1', 'ZBTB7A', 'NPLOC4', 'ARFGEF1', 'TRAF3IP2', 'BMS1', 'INPP4B', 'MYH14', 'KITLG', 'ATF5', 'TBCA', 'PICALM', 'FAM13B', 'FBXL18', 'MYO10', 'TAOK3', 'BBOF1', 'CLSPN', 'PAK2', 'STRIP1', 'IFI27L2', 'LTBR', 'BEND7', 'ESRP2', 'C6orf62', 'AAMP', 'PMEPA1', 'UBE2Q2', 'DHX37', 'SLAIN2', 'OTUD7B', 'RPL23', 'NCBP3', 'ATRX', 'NOM1', 'SMIM27']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2140 90
Actual Norm 73 3104
Accuracy: 0.969853893101535
Classification report:
precision recall f1-score support
Hypo 0.97 0.96 0.96 2230
Norm 0.97 0.98 0.97 3177
accuracy 0.97 5407
macro avg 0.97 0.97 0.97 5407
weighted avg 0.97 0.97 0.97 5407
ds_hcc_svm, ds_hcc_svm_features, ds_hcc_svm_accuracy = feature_selection_svm_rfe(
X = X_ds_hcc,
y = y_ds_hcc,
estimator = LinearSVC(max_iter = 10_000),
estimator_params = {"C": [0.025, 0.1, 1, 5]},
n_jobs = -1
)
Training data dimensions: (11011, 3000) Testing data dimensions: (3671, 3000) ========================= Training =========================
Best Parameters: {'estimator__C': 0.025}
Best Score (CV avg): 0.9404403100550717
Training accuracy: 0.9652093585427515
Number of selected genes: 402
Selected genes: ['BCYRN1', 'IGFBP3', 'H2AC11', 'RAPGEF3', 'DDIT4', 'MB', 'MT-TV', 'MT-TL1', 'NDRG1', 'MIR210HG', 'MT-TQ', 'SPN', 'ZNF263', 'MT-CO3', 'CACHD1', 'BTBD9', 'GDPGP1', 'ARTN', 'LINC01304', 'CACNB2', 'HELQ', 'NAXD', 'GPM6A', 'MT-TS1', 'MT-TA', 'CRIP2', 'NEAT1', 'DUSP9', 'PRR5L', 'USP35', 'MT-ND1', 'MT-CYB', 'PHC1', 'H19', 'H2AC12', 'CNR2', 'FGF23', 'MUL1', 'DANT1', 'AKR1C2', 'TMSB10', 'EHBP1L1', 'LDHA', 'MT-ND6', 'EFNA2', 'CITED2', 'H4C5', 'CNOT6L', 'CLDN4', 'CPNE2', 'DTNB', 'GABRE', 'LINC02511', 'MT-ATP6', 'LGALS1', 'NUPR2', 'H2AC16', 'MPDU1', 'ATXN2L', 'PGAM1', 'PPIL1', 'NCL', 'COL6A3', 'ABO', 'RPL17', 'SLC2A1', 'C4orf3', 'NOP10', 'KLC2', 'MT-CO2', 'NCALD', 'EGLN3', 'OPTN', 'ANKRD9', 'TRAK1', 'CNNM2', 'RRAS', 'BNIP3', 'ENTR1', 'FGF8', 'B4GALT1', 'GPI', 'LIMCH1', 'MIR663AHG', 'SREK1IP1P1', 'FBXL17', 'LINC02541', 'P4HA1', 'H2BC4', 'RPL41', 'COX8A', 'MT-TS2', 'PROSER1', 'PGK1', 'CAMK2N1', 'HEPACAM', 'MSR1', 'GDI1', 'SIGMAR1', 'AHNAK2', 'GCAT', 'SINHCAFP3', 'CTXN1', 'LINC01133', 'POLR3GL', 'HES4', 'PDCD4', 'TNFRSF12A', 'ENKD1', 'SHOX', 'RGPD4-AS1', 'HIF3A', 'S100A10', 'APOOL', 'RTL8C', 'ARNTL', 'HMGB2', 'NEDD9', 'TMEM70', 'FASTKD5', 'DAAM1', 'HSP90AB1', 'ZBED2', 'EFNA5', 'PSMG1', 'TMSB4XP4', 'NPM1P40', 'RPL39', 'AJAP1', 'SAMD4A', 'WDR77', 'PAQR7', 'NDUFB4', 'BTN3A2', 'VIT', 'ARHGDIA', 'H3C2', 'FOSL2', 'MIXL1', 'MCM3AP', 'GJB3', 'PRRC2A', 'FSD1L', 'IVL', 'KCNJ3', 'BNIP3L', 'S100A11', 'BMPR1B', 'H2BC9', 'TNNT1', 'CEP120', 'LINC02367', 'RAB30', 'ZBED4', 'RAB11FIP4', 'RNF122', 'NEDD4L', 'RAB2B', 'RPS27', 'CSTB', 'C1orf53', 'NCK1', 'CPEB1', 'MLLT3', 'MELTF-AS1', 'TCF7L1', 'MT1E', 'RPSAP48', 'TNFSF13B', 'ECH1', 'NDUFA8', 'MIOS-DT', 'KRT19', 'ZNF318', 'POLDIP2', 'VPS45', 'ZNF418', 'YTHDF3', 'MT-ND4L', 'PI4KB', 'ADARB1', 'AXL', 'CACNA1A', 'TUBB6', 'NRG4', 'NMD3', 'FAM126B', 'PHACTR1', 'TXNRD2', 'BAP1', 'HSPD1', 'PLD1', 'JAKMIP3', 'DDX23', 'RPL28', 'ANKEF1', 'RPS6KA6', 'DUSP5', 'SH3RF1', 'ARHGEF26', 'SLC6A8', 'JUN', 'OVOL1', 'APEH', 'CAVIN3', 'ZNF302', 'DCAKD', 'ARL2', 'LINC01902', 'RBSN', 'CREB1', 'TATDN2', 'PRRG3', 'RPS21', 'ALDOC', 'MMP2', 'POLE4', 'PTGR1', 'CCDC168', 'GBP1P1', 'TSHZ2', 'IRF2BPL', 'ADM', 'ZBTB20', 'CAST', 'RPS29', 'AKR1C1', 'PCDHGA10', 'RGS10', 'TGDS', 'EPHX1', 'KAT7', 'NEUROD2', 'CFAP251', 'MXRA5', 'PFKFB3', 'PLOD2', 'PPTC7', 'ING2', 'CD47', 'ZNF33B', 'KIRREL1', 'KDM3A', 'UQCC2', 'FUT11', 'MXI1', 'MED18', 'SYNJ2', 'SNHG18', 'RNF25', 'AKT1S1', 'KLLN', 'NCAM1', 'RAB12', 'PDLIM1', 'MT1X', 'DERA', 'YTHDF1', 'AMFR', 'CEP83', 'SF3B4', 'PHRF1', 'GYS1', 'SRA1', 'EPPK1', 'SYT14', 'FAM162A', 'KCNJ2', 'ARMC6', 'MKNK1', 'HSP90AA1', 'INHBA', 'SRSF8', 'FYN', 'BTBD7P1', 'CENPB', 'RHBDD2', 'SNX22', 'SLC2A6', 'LINC01116', 'ISOC2', 'MPHOSPH6', 'JUND', 'RAB3GAP1', 'MNS1', 'DTYMK', 'TOLLIP', 'GIN1', 'FAH', 'GOLGA4', 'TMEM256', 'DGKD', 'WDR43', 'CAMSAP2', 'NACA4P', 'ARHGAP42', 'NDUFC1', 'GAPDH', 'TMEM238', 'GRK2', 'DNAH11', 'ZBTB2', 'TRIM44', 'CIAO2A', 'UTP3', 'CALM2', 'BRMS1', 'PCDHB1', 'TTL', 'FOSL1', 'YKT6', 'ACSL4', 'CCDC34', 'SAT2', 'RHOT2', 'MAD2L1', 'DBT', 'RPL27A', 'RPL37A', 'NUP93', 'AMOTL2', 'PPP4R2', 'CARM1', 'VEGFB', 'NCLN', 'MLLT6', 'MAP2K3', 'DNAAF5', 'PUSL1', 'PSMA7', 'DDX54', 'TCEAL9', 'RPLP0P2', 'KRT4', 'SNORD3B-1', 'FEM1A', 'TRIM52-AS1', 'MCM4', 'CCNG2', 'YWHAZ', 'ARID5B', 'MRPL55', 'KMT2D', 'SPG21', 'ZC3H15', 'EMP2', 'LETM1', 'EIF3J', 'SNRNP70', 'RHOD', 'MAFF', 'MAZ', 'UQCR11', 'PLCE1', 'CPTP', 'ARHGEF7', 'STMN1', 'ZNF202', 'SNHG9', 'HMGA1', 'CLIC1', 'ZHX1', 'TPD52L1', 'FRY', 'DNMT3A', 'ARL13B', 'SMARCB1', 'TWNK', 'RRS1', 'HEY1', 'MRPS2', 'SLC25A48', 'TMEM80', 'DYSF', 'MTA2', 'C19orf53', 'ARSA', 'DGKZ', 'VRK3', 'UIMC1', 'PSIP1', 'ZNF688', 'CMIP', 'PPIG', 'EXOC7', 'TAF15', 'MARCKS', 'AK4', 'KIF5B', 'ATP5F1E', 'IRAK1', 'BRAT1', 'TSR1', 'SART1', 'CAP1', 'SETD2', 'METTL26', 'STC2', 'DDIT3', 'KEAP1', 'DLD', 'CLIP2']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2117 108
Actual Norm 92 1354
Accuracy: 0.9455189321710705
Classification report:
precision recall f1-score support
Hypo 0.96 0.95 0.95 2225
Norm 0.93 0.94 0.93 1446
accuracy 0.95 3671
macro avg 0.94 0.94 0.94 3671
weighted avg 0.95 0.95 0.95 3671
Select the top principal components from the models trained on PCA encoded data.
print("SmartSeq MCF7 SVM")
ss_mcf7_pca_svm_pcs = get_selected_pcs_from_model(ss_mcf7_pca_svm.model)
print()
print("SmartSeq HCC SVM")
ss_hcc_pca_svm_pcs = get_selected_pcs_from_model(ss_hcc_pca_svm.model)
print()
print("DropSeq MCF7 SVM")
ds_mcf7_pca_svm_pcs = get_selected_pcs_from_model(ds_mcf7_pca_svm.model)
print()
print("DropSeq HCC SVM")
ds_hcc_pca_svm_pcs = get_selected_pcs_from_model(ds_hcc_pca_svm.model)
print()
SmartSeq MCF7 SVM Top 8 principal components: [1, 3, 6, 8, 12, 16, 17, 18] SmartSeq HCC SVM Top 11 principal components: [2, 3, 9, 10, 12, 15, 17, 21, 26, 30, 32] DropSeq MCF7 SVM Top 300 principal components: [1, 2, 3, 5, 6, 8, 15, 16, 17, 18, 19, 25, 26, 27, 28, 29, 30, 31, 32, 33, 36, 37, 40, 43, 44, 45, 46, 52, 55, 56, 57, 60, 61, 62, 65, 66, 69, 71, 74, 81, 82, 85, 87, 88, 91, 92, 94, 95, 99, 100, 104, 105, 107, 110, 112, 114, 115, 116, 118, 119, 120, 121, 127, 128, 135, 138, 140, 141, 142, 145, 146, 147, 149, 153, 157, 160, 161, 167, 170, 172, 173, 175, 177, 181, 188, 190, 191, 193, 195, 198, 200, 201, 203, 204, 205, 206, 211, 212, 213, 218, 219, 230, 231, 232, 234, 235, 239, 243, 245, 247, 249, 252, 254, 255, 257, 260, 263, 264, 267, 271, 273, 275, 279, 281, 282, 286, 287, 291, 293, 302, 305, 312, 317, 318, 319, 320, 322, 323, 327, 329, 332, 339, 341, 342, 344, 348, 350, 352, 353, 355, 361, 362, 364, 370, 371, 375, 380, 383, 385, 387, 389, 391, 392, 393, 398, 399, 400, 401, 402, 406, 408, 409, 411, 415, 418, 419, 429, 431, 433, 434, 435, 436, 438, 442, 449, 455, 457, 459, 460, 461, 462, 464, 466, 469, 470, 471, 481, 483, 484, 485, 486, 487, 491, 494, 495, 496, 497, 499, 504, 507, 508, 510, 512, 515, 517, 518, 519, 520, 522, 527, 534, 538, 540, 541, 543, 546, 552, 555, 556, 557, 564, 565, 571, 576, 579, 580, 582, 585, 591, 594, 596, 597, 598, 599, 602, 603, 606, 610, 612, 615, 621, 623, 626, 627, 631, 632, 633, 642, 644, 646, 647, 650, 652, 653, 655, 658, 661, 667, 672, 674, 675, 677, 681, 682, 687, 696, 698, 700, 702, 705, 711, 718, 722, 724, 726, 729, 732, 733, 734, 736, 741, 742, 743, 745, 746, 751, 754, 755, 756, 758] DropSeq HCC SVM Top 343 principal components: [2, 3, 4, 5, 6, 8, 11, 12, 15, 16, 18, 19, 20, 21, 23, 24, 26, 27, 29, 30, 31, 32, 34, 36, 37, 38, 39, 41, 45, 46, 47, 48, 49, 53, 54, 55, 63, 65, 69, 72, 76, 77, 88, 89, 90, 92, 94, 96, 99, 102, 103, 106, 115, 117, 118, 120, 123, 124, 127, 131, 135, 136, 137, 139, 140, 141, 142, 143, 145, 147, 148, 153, 154, 155, 157, 161, 162, 166, 167, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 187, 189, 190, 191, 193, 197, 198, 200, 201, 205, 208, 210, 213, 215, 217, 218, 219, 220, 221, 224, 225, 227, 229, 230, 231, 234, 235, 237, 238, 239, 240, 243, 247, 249, 253, 254, 255, 259, 260, 261, 262, 263, 265, 266, 269, 270, 272, 275, 282, 290, 292, 294, 295, 297, 300, 301, 303, 305, 308, 310, 312, 313, 315, 318, 319, 323, 325, 326, 329, 332, 334, 336, 338, 339, 341, 345, 350, 351, 356, 357, 358, 359, 360, 362, 369, 371, 372, 373, 374, 375, 377, 379, 380, 382, 383, 384, 391, 393, 396, 399, 401, 402, 408, 409, 412, 413, 414, 415, 420, 427, 429, 434, 438, 439, 443, 444, 445, 447, 448, 450, 451, 452, 453, 457, 461, 474, 479, 488, 490, 494, 497, 503, 507, 509, 510, 515, 516, 521, 523, 525, 528, 534, 539, 540, 541, 543, 550, 551, 553, 561, 562, 563, 564, 565, 566, 567, 575, 576, 577, 581, 582, 583, 584, 586, 593, 594, 597, 598, 599, 600, 601, 603, 608, 609, 610, 612, 613, 622, 630, 632, 637, 640, 641, 642, 644, 646, 651, 656, 657, 667, 672, 674, 685, 689, 691, 692, 693, 698, 699, 700, 701, 705, 713, 714, 716, 717, 718, 730, 733, 746, 747, 749, 755, 756, 758, 760, 761, 763, 766, 769, 771, 777, 778, 781, 783, 784, 785, 789, 792, 793, 794, 799, 807, 808, 809, 810, 812, 813, 814, 815, 820, 828, 829, 831, 832, 835, 839, 842, 843]
Random forest¶
Use the feature selection pipeline to select genes from the raw data.
ss_mcf7_random_forest, ss_mcf7_random_forest_features, ss_mcf7_random_forest_accuracy = feature_selection_random_forest(
X = X_ss_mcf7,
y = y_ss_mcf7,
estimator = RandomForestClassifier(),
estimator_params = {
"n_estimators": [100, 200],
"max_depth": [5, 10, 20],
"min_samples_split": [5, 10],
"min_samples_leaf": [2, 4, 8],
"max_features": ["sqrt", 0.5],
"bootstrap": [True],
"class_weight": ["balanced"]
},
n_jobs = -1
)
Training data dimensions: (187, 3000) Testing data dimensions: (63, 3000) ========================= Training =========================
Best Parameters: {'estimator__bootstrap': True, 'estimator__class_weight': 'balanced', 'estimator__max_depth': 5, 'estimator__max_features': 'sqrt', 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 5, 'estimator__n_estimators': 100}
Best Score (CV avg): 0.9945945945945945
Training accuracy: 1.0
Number of selected genes: 54
Selected genes: ['CYP1B1', 'CYP1B1-AS1', 'NDRG1', 'PFKFB3', 'HK2', 'ADM', 'VEGFA', 'BNIP3', 'PFKFB4', 'ENO2', 'MT-CYB', 'SLC9A3R1', 'UBC', 'MT-CO3', 'GPI', 'EMP2', 'MT-CO2', 'DSCAM-AS1', 'PGK1', 'MT-CO1', 'DYNC2I2', 'SLC3A2', 'IFITM3', 'ERO1A', 'DSP', 'IRF2BP2', 'TUBG1', 'MT-ATP6', 'FUT11', 'P4HA1', 'FAM162A', 'PDK1', 'BNIP3L', 'MOV10', 'IFITM2', 'PYCR3', 'FDFT1', 'PFKP', 'ACLY', 'GAPDH', 'FDPS', 'FASN', 'TST', 'APEH', 'PSME2', 'SNRNP25', 'NECTIN2', 'TUBD1', 'MTATP6P1', 'EBP', 'ALDOA', 'CYB561A3', 'ACAT2', 'SQLE']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
ss_hcc_random_forest, ss_hcc_random_forest_features, ss_hcc_random_forest_accuracy = feature_selection_random_forest(
X = X_ss_hcc,
y = y_ss_hcc,
estimator = RandomForestClassifier(),
estimator_params = {
"n_estimators": [100, 200],
"max_depth": [5, 10, 20],
"min_samples_split": [5, 10],
"min_samples_leaf": [2, 4, 8],
"max_features": ["sqrt", 0.5],
"bootstrap": [True],
"class_weight": ["balanced"]
},
n_jobs = -1
)
Training data dimensions: (136, 3000) Testing data dimensions: (46, 3000) ========================= Training =========================
Best Parameters: {'estimator__bootstrap': True, 'estimator__class_weight': 'balanced', 'estimator__max_depth': 5, 'estimator__max_features': 'sqrt', 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 5, 'estimator__n_estimators': 100}
Best Score (CV avg): 0.9924263674614305
Training accuracy: 0.9926147162639153
Number of selected genes: 64
Selected genes: ['DDIT4', 'ANGPTL4', 'NDRG1', 'EGLN3', 'CA9', 'PLIN2', 'UPK1B', 'FAM83A', 'BNIP3', 'INSIG1', 'KRT19', 'BHLHE40', 'ALDOC', 'GPRC5A', 'KCTD11', 'SLC2A1', 'PGK1', 'SLC6A8', 'LOXL2', 'CDKN1A', 'PFKFB3', 'LDHA', 'ARRDC3', 'ADM', 'BUB1B', 'HILPDA', 'LBH', 'BUB1', 'FOSL2', 'KYNU', 'ASB2', 'ERO1A', 'EIF5', 'P4HA1', 'C1orf116', 'RALGDS', 'SNX33', 'MOB3A', 'GPI', 'CALB1', 'ALDOA', 'BLCAP', 'PLOD2', 'ZNF473', 'HES1', 'GYS1', 'ENO2', 'TMEM45A', 'BNIP3L', 'PLAC8', 'TPBG', 'C4orf3', 'EGLN1', 'PRSS8', 'FAM13A', 'SRM', 'HSPH1', 'MIF', 'LDHB', 'PPP1R3G', 'FUT11', 'FAM162A', 'KDM3A', 'P4HA2']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 24 1
Actual Norm 0 21
Accuracy: 0.9782608695652174
Classification report:
precision recall f1-score support
Hypo 1.00 0.96 0.98 25
Norm 0.95 1.00 0.98 21
accuracy 0.98 46
macro avg 0.98 0.98 0.98 46
weighted avg 0.98 0.98 0.98 46
ds_mcf7_random_forest, ds_mcf7_random_forest_features, ds_mcf7_random_forest_accuracy = feature_selection_random_forest(
X = X_ds_mcf7,
y = y_ds_mcf7,
estimator = RandomForestClassifier(),
estimator_params = {
"n_estimators": [200, 500, 1000],
"max_depth": [20, 50, None],
"min_samples_split": [2, 5],
"min_samples_leaf": [1, 2],
"max_features": ["sqrt", 0.5, 0.8],
"bootstrap": [True, False],
"class_weight": ["balanced", None]
},
n_jobs = -1
)
Training data dimensions: (16219, 3000)
Testing data dimensions: (5407, 3000)
========================= Training =========================
Best Parameters: {'estimator__n_estimators': 1000, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 2, 'estimator__max_features': 'sqrt', 'estimator__max_depth': 50, 'estimator__class_weight': 'balanced', 'estimator__bootstrap': True}
Best Score (CV avg): 0.96612717826348
Training accuracy: 0.9963072692759114
Number of selected genes: 72
Selected genes: ['MALAT1', 'MT-RNR2', 'TFF1', 'MT-RNR1', 'H4C3', 'MT-CO3', 'MT-ND4', 'MT-ND3', 'MT-CYB', 'TMSB10', 'MT-ATP6', 'MT-CO2', 'BCYRN1', 'RPS5', 'HES1', 'LGALS1', 'TMEM64', 'DSCAM-AS1', 'RPL12', 'RPS12', 'TOB1', 'RPL39', 'RPS16', 'TFF3', 'FGF23', 'RPL35', 'SOX4', 'RPS19', 'RPLP2', 'RPL36', 'PARD6B', 'RPL13', 'TPI1', 'S100A10', 'RPS28', 'FTL', 'RPL35A', 'ENO1', 'KRT19', 'RPS14', 'RPL34', 'DSP', 'UQCRQ', 'RPS15A', 'ROMO1', 'ELOB', 'KRT8', 'RPS15', 'ATP5ME', 'S100A11', 'ATP5MK', 'NDUFB2', 'RPL15', 'SNRPD2', 'RPS27', 'SET', 'UBA52', 'RPL37A', 'KRT18', 'GAPDH', 'TMEM258', 'PGK1', 'PKM', 'RPL30', 'ACTB', 'RPL11', 'HSPB1', 'RPLP1', 'SERF2', 'COX7A2', 'COX7C', 'RPL23']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2111 119
Actual Norm 54 3123
Accuracy: 0.9680044386905863
Classification report:
precision recall f1-score support
Hypo 0.98 0.95 0.96 2230
Norm 0.96 0.98 0.97 3177
accuracy 0.97 5407
macro avg 0.97 0.96 0.97 5407
weighted avg 0.97 0.97 0.97 5407
ds_hcc_random_forest, ds_hcc_random_forest_features, ds_hcc_random_forest_accuracy = feature_selection_random_forest(
X = X_ds_hcc,
y = y_ds_hcc,
estimator = RandomForestClassifier(),
estimator_params = {
"n_estimators": [200, 500, 1000],
"max_depth": [20, 50, None],
"min_samples_split": [2, 5],
"min_samples_leaf": [1, 2],
"max_features": ["sqrt", 0.5, 0.8],
"bootstrap": [True, False],
"class_weight": ["balanced", None]
},
n_jobs = -1
)
Training data dimensions: (11011, 3000) Testing data dimensions: (3671, 3000) ========================= Training =========================
Best Parameters: {'estimator__n_estimators': 1000, 'estimator__min_samples_split': 5, 'estimator__min_samples_leaf': 2, 'estimator__max_features': 'sqrt', 'estimator__max_depth': 50, 'estimator__class_weight': 'balanced', 'estimator__bootstrap': True}
Best Score (CV avg): 0.9260927654063373
Training accuracy: 0.9988589561630253
Number of selected genes: 112
Selected genes: ['MALAT1', 'MT-RNR2', 'BCYRN1', 'IGFBP3', 'H1-3', 'H4C3', 'HSPA5', 'PLEC', 'HSP90B1', 'NDRG1', 'MT-TQ', 'BTBD9', 'ENO1', 'GPM6A', 'HNRNPA2B1', 'NEAT1', 'H2AC12', 'H1-1', 'FGF23', 'AKR1C2', 'TMSB10', 'RPS28', 'LDHA', 'RPS5', 'PDIA3', 'NCL', 'NCALD', 'EGLN3', 'CNNM2', 'BNIP3', 'B4GALT1', 'EZR', 'P4HA1', 'RPL41', 'PGK1', 'AHNAK2', 'RPS19', 'RPL35', 'S100A10', 'SERF2', 'CENPF', 'HSP90AB1', 'RPL12', 'TMSB4X', 'POLR2L', 'NPM1P40', 'RPL39', 'PKM', 'KCNJ3', 'BNIP3L', 'S100A11', 'HNRNPU', 'RPS27', 'RPLP2', 'RPLP1', 'KRT19', 'RPS2', 'TPI1', 'TPT1', 'CACNA1A', 'BAP1', 'HSPD1', 'RPL28', 'ZNF302', 'GSTP1', 'EEF2', 'PRRG3', 'MT2A', 'CAST', 'S100A6', 'RPL36', 'AKR1C1', 'DSP', 'ATAD2', 'RPSA', 'ELOB', 'RPS8', 'CBX3', 'RPL21', 'HSP90AA1', 'WDR43', 'GAPDH', 'RPS3', 'TRIM44', 'TPX2', 'CALM2', 'FOSL1', 'PTMS', 'RPL27A', 'RPL37A', 'RPL8', 'UQCRQ', 'PSMA7', 'HNRNPM', 'EEF1A1', 'YWHAZ', 'RPL37', 'RPL10', 'ZC3H15', 'EIF3J', 'STMN1', 'PABPC1', 'HMGA1', 'ATP5MG', 'DNMT1', 'RPL13', 'CAV1', 'C19orf53', 'MARCKS', 'ATP5F1E', 'RPS14', 'RAC1']
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2127 98
Actual Norm 170 1276
Accuracy: 0.9269953691092345
Classification report:
precision recall f1-score support
Hypo 0.93 0.96 0.94 2225
Norm 0.93 0.88 0.90 1446
accuracy 0.93 3671
macro avg 0.93 0.92 0.92 3671
weighted avg 0.93 0.93 0.93 3671
Select the top principal components from the models trained on PCA encoded data.
print("SmartSeq MCF7 Random Forest")
ss_mcf7_pca_random_forest_pcs = get_selected_pcs_from_model(ss_mcf7_pca_random_forest.model)
print()
print("SmartSeq HCC Random Forest")
ss_hcc_pca_random_forest_pcs = get_selected_pcs_from_model(ss_hcc_pca_random_forest.model)
print()
print("DropSeq MCF7 Random Forest")
ds_mcf7_pca_random_forest_pcs = get_selected_pcs_from_model(ds_mcf7_pca_random_forest.model)
print()
print("DropSeq HCC Random Forest")
ds_hcc_pca_random_forest_pcs = get_selected_pcs_from_model(ds_hcc_pca_random_forest.model)
print()
SmartSeq MCF7 Random Forest Top 7 principal components: [1, 2, 3, 4, 5, 6, 9] SmartSeq HCC Random Forest Top 3 principal components: [2, 3, 4] DropSeq MCF7 Random Forest Top 82 principal components: [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 18, 19, 21, 23, 25, 26, 27, 28, 32, 33, 35, 36, 37, 38, 47, 48, 55, 60, 82, 95, 97, 110, 116, 121, 133, 140, 142, 144, 145, 147, 149, 151, 157, 167, 170, 232, 236, 278, 297, 300, 301, 303, 307, 314, 317, 318, 320, 321, 322, 326, 328, 331, 335, 338, 344, 358, 367, 377, 379, 380, 381, 383, 403, 422, 424, 442, 458, 474, 475] DropSeq HCC Random Forest Top 104 principal components: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 23, 24, 26, 30, 31, 34, 35, 36, 37, 38, 39, 41, 42, 44, 45, 46, 47, 48, 49, 53, 54, 56, 60, 63, 65, 68, 69, 70, 72, 73, 74, 75, 76, 78, 80, 90, 99, 100, 102, 103, 106, 109, 110, 126, 127, 133, 134, 141, 142, 145, 147, 155, 156, 161, 167, 169, 170, 172, 174, 176, 182, 184, 185, 187, 189, 190, 192, 195, 196, 197, 198, 201, 202, 208, 212, 240, 250, 259, 262, 270, 283, 290, 301, 317]
Multilayer perceptron¶
Since multilayer perceptrons model complex interactions through their weights, they do not expose a "feature importances" property. This eliminates the possibility of recursive feature elimination. There is thus not much of a meaningful way to select features beyond using ANOVA or a model pre-selector (LinearSVC or RandomForestClassifier) which has already been done for the other models. Moreover, there is also no way to intrinsically select the top principal components from the pre-trained models.
However, due to diverse selection of features already done on the other models, there is already a robust, filtered feature set that isn't determined by any single model.
Top genes¶
The intersection of the selected genes for each data set, i.e. genes with X # of occurrences, can be used to develop a more robust, generalized model that can efficiently classify data. Feature selection on the models trained on the not PCA-encoded data selects genes with stronger predictive relationships with the label. Given a certain model, a certain gene may not be selected for all of them since there are different datasets. To understand the big picture, the selected genes can be collected and their occurrences can be counted to show in how many models the gene is selected.
logit_feature_lists = [ss_mcf7_logit_features, ss_hcc_logit_features, ds_mcf7_logit_features, ds_hcc_logit_features]
top_logit_features_occurrences = count_and_sort_occurrences(logit_feature_lists, False)
top_logit_features = {}
for i in range(4, 0, -1):
top_logit_features[i] = top_logit_features.get(i + 1, []) + filter_by_occurrences(top_logit_features_occurrences, i)
for i in range(4, 0, -1):
if len(top_logit_features[i]) == 0:
continue
print(f"{len(top_logit_features[i])} gene(s) selected for logistic regression across {i}+ data set(s):")
print(top_logit_features[i])
print()
3 gene(s) selected for logistic regression across 4+ data set(s): ['MT-CO3', 'PGK1', 'MT-CYB'] 10 gene(s) selected for logistic regression across 3+ data set(s): ['MT-CO3', 'PGK1', 'MT-CYB', 'DDIT4', 'LDHA', 'MT-CO2', 'MT-ATP6', 'TMSB10', 'HMGA1', 'KRT19'] 95 gene(s) selected for logistic regression across 2+ data set(s): ['MT-CO3', 'PGK1', 'MT-CYB', 'DDIT4', 'LDHA', 'MT-CO2', 'MT-ATP6', 'TMSB10', 'HMGA1', 'KRT19', 'TRIM44', 'MT-TA', 'GPM6A', 'FUT11', 'MT-ND6', 'NDRG1', 'MT-RNR1', 'BAP1', 'PLOD2', 'IGFBP3', 'MT-ND4L', 'TUBA1B', 'HSPH1', 'CSTB', 'AURKA', 'ATXN2L', 'MT-ND1', 'GPI', 'EGLN3', 'BMPR1B', 'BNIP3', 'CAMK2N1', 'NCALD', 'CACNA1A', 'CAV1', 'DHCR7', 'PFKFB3', 'S100A11', 'C4orf3', 'S100A10', 'GOLGA4', 'GATA3', 'GAPDH', 'BTN3A2', 'BTBD9', 'KCNJ2', 'KCNJ3', 'RPSAP48', 'MT-TS1', 'MT-TQ', 'TPD52L1', 'BNIP3L', 'RPS6KA6', 'NEDD4L', 'ALDOC', 'SLC6A8', 'AMOTL2', 'PROSER1', 'FOSL2', 'ALDOA', 'NPM1P40', 'AKT1S1', 'ZC3H15', 'H2AC12', 'AKR1C2', 'AKR1C1', 'EMP2', 'HES1', 'ZNF302', 'CKS2', 'PRRG3', 'ADM', 'CLDN4', 'FAM162A', 'ZNF688', 'P4HA1', 'MAFF', 'H2AC11', 'CAMSAP2', 'YTHDF3', 'RHOD', 'ARMC6', 'KMT2D', 'SLC25A48', 'KPNA2', 'HSP90AA1', 'LGALS1', 'KRT4', 'SLC2A1', 'FEM1A', 'H19', 'FGF23', 'WDR43', 'APOOL', 'RGPD4-AS1'] 858 gene(s) selected for logistic regression across 1+ data set(s): ['MT-CO3', 'PGK1', 'MT-CYB', 'DDIT4', 'LDHA', 'MT-CO2', 'MT-ATP6', 'TMSB10', 'HMGA1', 'KRT19', 'TRIM44', 'MT-TA', 'GPM6A', 'FUT11', 'MT-ND6', 'NDRG1', 'MT-RNR1', 'BAP1', 'PLOD2', 'IGFBP3', 'MT-ND4L', 'TUBA1B', 'HSPH1', 'CSTB', 'AURKA', 'ATXN2L', 'MT-ND1', 'GPI', 'EGLN3', 'BMPR1B', 'BNIP3', 'CAMK2N1', 'NCALD', 'CACNA1A', 'CAV1', 'DHCR7', 'PFKFB3', 'S100A11', 'C4orf3', 'S100A10', 'GOLGA4', 'GATA3', 'GAPDH', 'BTN3A2', 'BTBD9', 'KCNJ2', 'KCNJ3', 'RPSAP48', 'MT-TS1', 'MT-TQ', 'TPD52L1', 'BNIP3L', 'RPS6KA6', 'NEDD4L', 'ALDOC', 'SLC6A8', 'AMOTL2', 'PROSER1', 'FOSL2', 'ALDOA', 'NPM1P40', 'AKT1S1', 'ZC3H15', 'H2AC12', 'AKR1C2', 'AKR1C1', 'EMP2', 'HES1', 'ZNF302', 'CKS2', 'PRRG3', 'ADM', 'CLDN4', 'FAM162A', 'ZNF688', 'P4HA1', 'MAFF', 'H2AC11', 'CAMSAP2', 'YTHDF3', 'RHOD', 'ARMC6', 'KMT2D', 'SLC25A48', 'KPNA2', 'HSP90AA1', 'LGALS1', 'KRT4', 'SLC2A1', 'FEM1A', 'H19', 'FGF23', 'WDR43', 'APOOL', 'RGPD4-AS1', 'GPATCH4', 'GRK2', 'GJB3', 'H2BC4', 'GLE1', 'H2AX', 'GNAQ', 'H2AC20', 'H1-0', 'GPX2', 'GPRC5A', 'GOLGA3', 'H2AC16', 'GSE1', 'GYS1', 'GREM1', 'GAB2', 'GIN1', 'FBRS', 'FLOT2', 'FLNA', 'FGFBP1', 'FGF8', 'FDFT1', 'FBXL18', 'FBXL17', 'FBXL16', 'FASTKD5', 'GFRA1', 'FARP1', 'FAM83A', 'FAM50A', 'FAM189B', 'FAM177A1', 'FAM13B', 'FAM126B', 'FAM111B', 'FN1', 'FOS', 'FOSL1', 'FRS2', 'GDPGP1', 'GDI1', 'GDF15', 'GDAP2', 'GCAT', 'GBP1P1', 'GATAD2A', 'GABRE', 'GABPB2', 'H3C2', 'FYN', 'FYB1', 'FTL', 'FTH1', 'FSD1L', 'FSCN1', 'FRY', 'H2BC9', 'IFI27L2', 'H4C3', 'KITLG', 'KRT80', 'KRT8', 'KRT18', 'KPNA4', 'KNSTRN', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KIRREL1', 'H4C5', 'KIF5B', 'KIF2C', 'KIF23', 'KIF14', 'KHSRP', 'KEAP1', 'KDM5B', 'KDM3A', 'KCTD11', 'KYNU', 'LAD1', 'LAMB3', 'LCLAT1', 'LXN', 'LTBR', 'LRRFIP2', 'LPP', 'LOXL2', 'LMNB2', 'LMNA', 'LINC02541', 'LINC02511', 'LINC02367', 'LINC01902', 'LINC01304', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'LETM1', 'LDLRAP1', 'LDHB', 'KCNQ1OT1', 'KAT7', 'JUP', 'FAM102A', 'ID3', 'HSPD1', 'HSPA8', 'HSPA5', 'HSP90B1', 'HSP90AB1', 'HRH1', 'HOXC13', 'HMGCS1', 'HMGB2', 'HLA-A', 'HIF3A', 'HEY1', 'HES4', 'HERPUD1', 'HEPACAM', 'HELQ', 'HCFC1', 'HBP1', 'IER2', 'IGFBP5', 'JUND', 'ILRUN', 'JUNB', 'JUN', 'JAKMIP3', 'IWS1', 'IVL', 'ITPK1', 'ITGA6', 'ISOC2', 'ISG15', 'ISCU', 'IRF6', 'IRF2BPL', 'IRAK1', 'INSIG1', 'INPP4B', 'INHBA', 'ING2', 'INF2', 'INCENP', 'FAM104A', 'ZWINT', 'FAH', 'BRIP1', 'C7orf50', 'C6orf62', 'C2orf49', 'C1orf53', 'C19orf53', 'C16orf91', 'C10orf55', 'BTBD7P1', 'BRPF3', 'BRMS1', 'BRAT1', 'CDC25B', 'BOLA3', 'BMS1', 'BLOC1S3', 'BLCAP', 'BIRC5', 'BICDL1', 'BHLHE40', 'BCYRN1', 'BCL3', 'BCAS3', 'C9orf78', 'CA9', 'CACHD1', 'CACNB2', 'CD47', 'CD44', 'CCNG2', 'CCNB2', 'CCNB1', 'CCM2', 'CCDC34', 'CCDC18', 'CCDC168', 'CBX3', 'CBFA2T3', 'CAVIN3', 'CAST', 'CASP8AP2', 'CARM1', 'CAPZA1', 'CAP1', 'CANX', 'CALM2', 'CALHM2', 'CACNG4', 'BBOF1', 'BAZ2A', 'BAG3', 'APEH', 'ANKRD9', 'ANKRD52', 'ANKRD40', 'ANKRD17', 'ANKEF1', 'ANGPTL4', 'AMFR', 'ALDH1A3', 'AKR1C3', 'AKAP5', 'AK4', 'AJAP1', 'AHNAK2', 'AFF1', 'ADARB1', 'ADAP1', 'ACTB', 'ACSL4', 'ACAT2', 'ABO', 'ABL1', 'ANXA6', 'ARF3', 'B4GALT1', 'ARFGEF1', 'AXL', 'ATXN1L', 'ATRX', 'ATP5F1E', 'ATN1', 'ATF5', 'ARTN', 'ARSA', 'ARPP19', 'ARNTL2', 'ARNTL', 'ARL2', 'ARL13B', 'ARIH1', 'ARID5B', 'ARID1B', 'ARHGEF7', 'ARHGEF26', 'ARHGDIA', 'ARHGAP42', 'ARHGAP26', 'CDC20', 'CDC6', 'F3', 'DNAJA3', 'DYSF', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DTNB', 'DSP', 'DNMT3A', 'DNAJC21', 'DNAJA4', 'DNAJA1', 'CDK2AP2', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'DHRS3', 'DGKZ', 'DGKD', 'EBAG9', 'ECH1', 'EFNA2', 'EFNA5', 'EXOC7', 'ETF1', 'ESRP2', 'ERO1A', 'EPPK1', 'EPHX1', 'ENTR1', 'ENOX2', 'ENO1', 'ENKD1', 'ELP3', 'ELOA', 'EIF5', 'EIF4G2', 'EIF4A2', 'EIF3J', 'EIF3A', 'EIF2B4', 'EHBP1L1', 'LYAR', 'EGLN1', 'DERA', 'DDX54', 'DDX5', 'CNR2', 'CNOT6L', 'CNNM2', 'CMIP', 'CLTB', 'CLSPN', 'CLIP2', 'CLIC1', 'CLDN7', 'CKAP2', 'CITED2', 'CIAO2A', 'CHAC2', 'CHAC1', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CENPB', 'CEACAM5', 'CDKN1A', 'CNOT9', 'COL6A3', 'DDX23', 'COX8A', 'DDIT3', 'DCTN1', 'DCBLD2', 'DCAKD', 'DBT', 'DBNDD1', 'DANT1', 'DAAM1', 'CYP1B1', 'CXCL1', 'CTXN1', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CPTP', 'CPNE2', 'CPEB1', 'LY6D', 'MIF-AS1', 'MAD2L1', 'SOCS2', 'SRA1', 'SQSTM1', 'SPRY1', 'SPP1', 'SPN', 'SPG21', 'SPATS2L', 'SPAG5', 'SOX4', 'SOS1', 'SNX27', 'SREK1IP1P1', 'SNX24', 'SNX22', 'SNRNP70', 'SNORD3B-1', 'SNHG9', 'SNHG18', 'SMKR1', 'SMIM27', 'SMC6', 'SMC5', 'SRCAP', 'SRFBP1', 'MAP2K3', 'TAF13', 'TCF20', 'TCEAL9', 'TBKBP1', 'TBCA', 'TBC1D9', 'TATDN2', 'TARS1', 'TAOK3', 'TAF9B', 'TAF15', 'SYT14', 'SRM', 'SYNJ2', 'SYNE2', 'SULF2', 'STRIP1', 'STRBP', 'STMN1', 'STC2', 'STARD10', 'SSX2IP', 'SRXN1', 'SMARCB1', 'SLCO4A1', 'SLC9A3R1', 'RPLP0P2', 'S100P', 'S100A2', 'RTL8C', 'RSRC2', 'RRS1', 'RRP1B', 'RRAS', 'RPS29', 'RPS27', 'RPS21', 'RPL41', 'SLC48A1', 'RPL39', 'RPL37A', 'RPL34', 'RPL30', 'RPL28', 'RPL27A', 'RPL23', 'RPL17', 'RPL15', 'RPL13', 'SAMD4A', 'SART1', 'SAT2', 'SCD', 'SLC39A6', 'SLC38A2', 'SLC2A6', 'SLC25A24', 'SLC20A1', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'SIGMAR1', 'SHOX', 'SHISA5', 'SH3RF1', 'SF3B4', 'SETD3', 'SETD2', 'SET', 'SERINC5', 'SENP6', 'SEMA4B', 'SECISBP2L', 'SCYL2', 'TCF7L1', 'TCHP', 'TEDC2-AS1', 'VIT', 'YKT6', 'XBP1', 'WWC3', 'WTAPP1', 'WSB2', 'WDR77', 'VRK3', 'VPS9D1-AS1', 'VPS45', 'VMP1', 'VEGFB', 'UBB', 'VCPIP1', 'UTP3', 'UTP18', 'USP35', 'USP32', 'UQCR11', 'UQCC2', 'UPK1B', 'UIMC1', 'UGDH', 'YTHDF1', 'YWHAB', 'YWHAZ', 'ZBED2', 'ZNRF1', 'ZNF764', 'ZNF703', 'ZNF702P', 'ZNF480', 'ZNF418', 'ZNF354A', 'ZNF33B', 'ZNF326', 'ZNF318', 'ZNF263', 'ZNF202', 'ZMIZ1', 'ZHX1', 'ZFP36', 'ZFC3H1', 'ZBTB7A', 'ZBTB34', 'ZBTB20', 'ZBTB2', 'ZBED4', 'UBE2Q2', 'UBA52', 'TFF1', 'TMEM258', 'TOB1', 'TNNT1', 'TNIP2', 'TNFSF13B', 'TNFRSF12A', 'TMSB4XP4', 'TMEM80', 'TMEM70', 'TMEM64', 'TMEM259', 'TMEM256', 'TYSND1', 'TMEM238', 'TIMELESS', 'TIAM1', 'THRB', 'THBS1', 'THAP1', 'TGFB3', 'TGDS', 'TFRC', 'TFF3', 'TOLLIP', 'TPBG', 'TPI1', 'TPM1', 'TXNRD2', 'TXNIP', 'TXN', 'TWNK', 'TUBB6', 'TUBB4B', 'TUBB', 'TTL', 'TSR1', 'TSPYL1', 'TSPO', 'TSHZ2', 'TRIM52-AS1', 'TRIM37', 'TRIM29', 'TRIM16', 'TRAK2', 'TRAK1', 'TRAF3IP2', 'TPX2', 'TPM4', 'RNF25', 'RNF146', 'RNF122', 'MYO5C', 'NCKAP1', 'NCK1', 'NCDN', 'NCBP3', 'NCAM1', 'NBEAL2', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO10', 'MT-TV', 'MYH14', 'MYC', 'MXRA5', 'MXI1', 'MTND2P28', 'MTND1P23', 'MTA2', 'MT2A', 'MT1X', 'MT1E', 'NCL', 'NCLN', 'NCOA1', 'NCOA5', 'NSD1', 'NRP1', 'NRG4', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'NOLC1', 'NOL4L', 'NME1-NME2', 'NMD3', 'NLK', 'NINJ1', 'NFIC', 'NEUROD2', 'NEDD9', 'NEDD1', 'NEAT1', 'NDUFC1', 'NDUFB4', 'NDUFA8', 'MT-TY', 'MT-TS2', 'NUP188', 'MELTF-AS1', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'ZRANB1', 'MIF', 'MGRN1', 'MGLL', 'METTL26', 'MED18', 'MT-TP', 'MDM2', 'MCM4', 'MCM3AP', 'MB', 'MAZ', 'MARK3', 'MARK2', 'MARCKS', 'MAPKAPK2', 'MAP3K13', 'MLLT3', 'MLLT6', 'MMP1', 'MMP2', 'MT-TN', 'MT-TM', 'MT-TL1', 'MT-TE', 'MT-TD', 'MT-RNR2', 'MT-ND5', 'MT-ND4', 'MT-ND3', 'MT-ND2', 'MT-CO1', 'MT-ATP8', 'MSR1', 'MSMO1', 'MSMB', 'MRPL55', 'MRNIP', 'MPHOSPH9', 'MPHOSPH6', 'MPDU1', 'MNS1', 'NT5C', 'NUP93', 'RHOT2', 'PRR12', 'PSMD2', 'PSMD14', 'PSMA7', 'PSIP1', 'PRXL2C', 'PRSS23', 'PRRC2C', 'PRRC2A', 'PRR5L', 'PRR34-AS1', 'PRNP', 'POLR3GL', 'PRMT6', 'PREX1', 'PRDX1', 'PRC1', 'PPTC7', 'PPP4R2', 'PPP1R12B', 'PPM1G', 'PPIL1', 'PPIG', 'PSMD5', 'PSME4', 'PSMG1', 'PTGR1', 'RHBDD2', 'RGS10', 'RFK', 'RCC1L', 'RBSN', 'RBBP6', 'RAPGEF3', 'RAI14', 'RAD23A', 'RABEP1', 'RAB5C', 'RAB3GAP1', 'RAB35', 'RAB30', 'RAB2B', 'RAB27A', 'RAB1B', 'RAB12', 'RAB11FIP4', 'PYGO2', 'PTP4A2', 'PPIF', 'POLR3A', 'NUPR2', 'PCDH1', 'PFDN4', 'PERP', 'PDS5A', 'PDLIM1', 'PDCD4', 'PDAP1', 'PCYT1A', 'PCNA', 'PCDHGA10', 'PCDHB1', 'PATL1', 'POLR2A', 'PARD6B', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'P4HA2', 'OVOL1', 'OTUD7B', 'OPTN', 'PGAM1', 'PGAM5', 'PHACTR1', 'PHF20L1', 'POLE4', 'POLDIP2', 'POLB', 'PMEPA1', 'PLK2', 'PLIN2', 'PLEC', 'PLD1', 'PLCE1', 'PLCD3', 'PLCB4', 'PLBD2', 'PLAU', 'PKM', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'PHLDA2', 'AAMP']
svm_feature_lists = [ss_mcf7_svm_features, ss_hcc_svm_features, ds_mcf7_svm_features, ds_hcc_svm_features]
top_svm_features_occurrences = count_and_sort_occurrences(svm_feature_lists, False)
top_svm_features = {}
for i in range(4, 0, -1):
top_svm_features[i] = top_svm_features.get(i + 1, []) + filter_by_occurrences(top_svm_features_occurrences, i)
for i in range(4, 0, -1):
if len(top_svm_features[i]) == 0:
continue
print(f"{len(top_svm_features[i])} gene(s) selected for SVM across {i}+ data set(s):")
print(top_svm_features[i])
print()
2 gene(s) selected for SVM across 4+ data set(s): ['PGK1', 'MT-CYB'] 7 gene(s) selected for SVM across 3+ data set(s): ['PGK1', 'MT-CYB', 'LDHA', 'TMSB10', 'DDIT4', 'MT-CO3', 'MT-CO2'] 70 gene(s) selected for SVM across 2+ data set(s): ['PGK1', 'MT-CYB', 'LDHA', 'TMSB10', 'DDIT4', 'MT-CO3', 'MT-CO2', 'S100A10', 'RPSAP48', 'TPD52L1', 'RPS6KA6', 'BAP1', 'ATXN2L', 'NEDD4L', 'TRIM44', 'BMPR1B', 'MT-TS1', 'MT-TQ', 'KRT19', 'TWNK', 'FGF23', 'FEM1A', 'MT-TA', 'KMT2D', 'TMEM64', 'KCNJ3', 'KCNJ2', 'HMGA1', 'NPM1P40', 'H2AC12', 'H2AC11', 'H19', 'STC2', 'CAV1', 'GPM6A', 'SLC25A48', 'GOLGA4', 'CAMSAP2', 'CAMK2N1', 'CACNA1A', 'BTN3A2', 'BTBD9', 'GATA3', 'GAPDH', 'NCALD', 'S100A11', 'ARMC6', 'HEPACAM', 'DSP', 'MAFF', 'RHOD', 'ZBTB20', 'YTHDF3', 'ZNF302', 'FAM162A', 'ZC3H15', 'MT-ATP6', 'AKT1S1', 'ZNF688', 'MT-ND6', 'AKR1C2', 'PRRG3', 'RGPD4-AS1', 'APOOL', 'WDR43', 'EMP2', 'MT-ND1', 'PROSER1', 'LGALS1', 'MT-ND4L'] 757 gene(s) selected for SVM across 1+ data set(s): ['PGK1', 'MT-CYB', 'LDHA', 'TMSB10', 'DDIT4', 'MT-CO3', 'MT-CO2', 'S100A10', 'RPSAP48', 'TPD52L1', 'RPS6KA6', 'BAP1', 'ATXN2L', 'NEDD4L', 'TRIM44', 'BMPR1B', 'MT-TS1', 'MT-TQ', 'KRT19', 'TWNK', 'FGF23', 'FEM1A', 'MT-TA', 'KMT2D', 'TMEM64', 'KCNJ3', 'KCNJ2', 'HMGA1', 'NPM1P40', 'H2AC12', 'H2AC11', 'H19', 'STC2', 'CAV1', 'GPM6A', 'SLC25A48', 'GOLGA4', 'CAMSAP2', 'CAMK2N1', 'CACNA1A', 'BTN3A2', 'BTBD9', 'GATA3', 'GAPDH', 'NCALD', 'S100A11', 'ARMC6', 'HEPACAM', 'DSP', 'MAFF', 'RHOD', 'ZBTB20', 'YTHDF3', 'ZNF302', 'FAM162A', 'ZC3H15', 'MT-ATP6', 'AKT1S1', 'ZNF688', 'MT-ND6', 'AKR1C2', 'PRRG3', 'RGPD4-AS1', 'APOOL', 'WDR43', 'EMP2', 'MT-ND1', 'PROSER1', 'LGALS1', 'MT-ND4L', 'GOLGA3', 'FAM111B', 'GNAQ', 'FAM104A', 'GPATCH4', 'GPI', 'GLE1', 'FAM126B', 'FBXL18', 'GREM1', 'FAM102A', 'FAH', 'H4C5', 'H3C2', 'H2BC9', 'H2BC4', 'H2AC20', 'H2AC16', 'EPHX1', 'EPPK1', 'ESRP2', 'GYS1', 'GSE1', 'ETF1', 'GRK2', 'GIN1', 'EXOC7', 'GJB3', 'FAM13B', 'GDPGP1', 'FRY', 'FBXL17', 'FGD5-AS1', 'FGF8', 'FBXL16', 'FLNA', 'FLOT2', 'FOS', 'FOSL1', 'HCFC1', 'FRS2', 'FBRS', 'FASTKD5', 'FARP1', 'FSD1L', 'GDI1', 'FUT11', 'FYN', 'FAM50A', 'FAM189B', 'FAM177A1', 'GAB2', 'GABPB2', 'GABRE', 'GATAD2A', 'GBP1P1', 'GCAT', 'GDAP2', 'GDF15', 'FOSL2', 'ZRANB1', 'HELQ', 'KRT80', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'LETM1', 'LDLRAP1', 'LCLAT1', 'LAD1', 'KRT4', 'HES1', 'KPNA4', 'KPNA2', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KITLG', 'KIRREL1', 'LINC01304', 'LINC01902', 'LINC02367', 'LINC02511', 'MCM3AP', 'MB', 'MAZ', 'MARK3', 'MARK2', 'MARCKS', 'MAPKAPK2', 'MAP3K13', 'MAP2K3', 'MAD2L1', 'LYAR', 'LXN', 'LTBR', 'LRRFIP2', 'LPP', 'LMNB2', 'LINC02541', 'KIF5B', 'KIF14', 'KHSRP', 'IGFBP5', 'ENOX2', 'IFITM3', 'IFI27L2', 'HSPH1', 'HSPD1', 'HSPA8', 'HSPA5', 'HSP90AB1', 'HSP90AA1', 'HPCAL1', 'HOXC13', 'HNRNPA2B1', 'HMGB2', 'HILPDA', 'HIF3A', 'HEY1', 'HES4', 'IGFBP3', 'ILRUN', 'KEAP1', 'INCENP', 'KDM3A', 'KCNQ1OT1', 'KAT7', 'JUND', 'JUN', 'JAKMIP3', 'IWS1', 'IVL', 'ITPK1', 'ISOC2', 'ISCU', 'IRF2BPL', 'IRAK1', 'INPP4B', 'INHBA', 'ING2', 'INF2', 'ENTR1', 'DYNC2I2', 'ENO1', 'BRIP1', 'BOLA3', 'BNIP3L', 'BNIP3', 'BMS1', 'BLOC1S3', 'BICDL1', 'BEND7', 'BCYRN1', 'BCL3', 'BCAS3', 'BBOF1', 'BAZ2A', 'B4GALT1', 'AXL', 'AURKA', 'ATXN1L', 'ATRX', 'ATP9A', 'ATP5F1E', 'BRAT1', 'BRMS1', 'ENKD1', 'BRPF3', 'CAST', 'CASP8AP2', 'CARM1', 'CAPZA1', 'CAP1', 'CALM2', 'CALHM2', 'CACNG4', 'CACNB2', 'CACHD1', 'C9orf78', 'C7orf50', 'C6orf62', 'C4orf3', 'C2orf49', 'C1orf53', 'C19orf53', 'C16orf91', 'BTBD7P1', 'ATN1', 'ATF5', 'ARTN', 'ARSA', 'ANKEF1', 'ANGPTL4', 'AMOTL2', 'AMFR', 'ALDOC', 'ALDOA', 'AKR1C3', 'AKR1C1', 'AKAP5', 'AK4', 'AJAP1', 'AHNAK2', 'AFF1', 'ADM', 'ADARB1', 'ACTB', 'ACSL4', 'ABO', 'ABL1', 'ANKRD17', 'ANKRD40', 'ANKRD52', 'ARID1B', 'ARPP19', 'ARPC1B', 'ARNTL2', 'ARNTL', 'ARL2', 'ARL13B', 'ARIH1', 'ARID5B', 'ARHGEF7', 'ANKRD9', 'ARHGEF26', 'ARHGDIA', 'ARHGAP42', 'ARHGAP26', 'ARFGEF1', 'ARF3', 'APEH', 'ANXA6', 'CAVIN3', 'CBFA2T3', 'CBX3', 'DAAM1', 'DNAJA3', 'DNAJA1', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'DGKZ', 'DGKD', 'DERA', 'DDX54', 'DDX23', 'DDIT3', 'DCTN1', 'DCAKD', 'DBT', 'DBNDD1', 'DNAJA4', 'DNAJC21', 'DNMT3A', 'EFNA5', 'ELP3', 'ELOA', 'EIF4G2', 'EIF3J', 'EIF3A', 'EIF2B4', 'EHBP1L1', 'EGLN3', 'EFNA2', 'DTNB', 'ECH1', 'EBAG9', 'DYSF', 'MCM7', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DANT1', 'CTXN1', 'CCDC168', 'CSTB', 'CLIC1', 'CLDN4', 'CKS2', 'CITED2', 'CIAO2A', 'CHAC2', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CENPB', 'CDC20', 'CD9', 'CD47', 'CD44', 'CCNG2', 'CCDC34', 'CCDC18', 'CLIP2', 'CLSPN', 'CLTB', 'CPTP', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CRABP2', 'CPNE2', 'CMIP', 'CPEB4', 'CPEB1', 'COX8A', 'COL6A3', 'CNR2', 'CNOT9', 'CNOT6L', 'CNNM2', 'MCM4', 'MPHOSPH6', 'MDM2', 'SPG21', 'SOX4', 'SOS1', 'SOCS2', 'SNX27', 'SNX24', 'SNX22', 'SNRNP70', 'SNORD3B-1', 'SNHG9', 'SNHG18', 'SMKR1', 'SMIM27', 'SMC6', 'SMC5', 'SMARCB1', 'SLC6A8', 'SLC48A1', 'SLC2A6', 'SLC2A1', 'SPATS2L', 'SPN', 'MED18', 'SPRY1', 'TAOK3', 'TAF9B', 'TAF15', 'TAF13', 'SYTL2', 'SYT14', 'SYNJ2', 'SYNE2', 'SULF2', 'STRIP1', 'STRBP', 'STMN1', 'STARD10', 'SSX2IP', 'SRSF8', 'SRFBP1', 'SREK1IP1P1', 'SRCAP', 'SRA1', 'SLC25A24', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'RPS29', 'RPS27', 'RPS21', 'RPLP0P2', 'RPL41', 'RPL39', 'RPL37A', 'RPL34', 'RPL30', 'RPL28', 'RPL27A', 'RPL23', 'RPL17', 'RPL15', 'RPL13', 'RNF25', 'RNF146', 'RNF122', 'RHOT2', 'RRAS', 'RRP1B', 'RRS1', 'SERINC5', 'SIGMAR1', 'SHOX', 'SHISA5', 'SH3RF1', 'SF3B4', 'SETD3', 'SETD2', 'SET', 'SENP6', 'RSRC2', 'SENP3', 'SECISBP2L', 'SCYL2', 'SAT2', 'SART1', 'SAMD4A', 'S100P', 'RTL8C', 'TARS1', 'TATDN2', 'TBC1D9', 'UBA52', 'YTHDF1', 'YKT6', 'XBP1', 'WWC3', 'WSB2', 'WDR77', 'VRK3', 'VPS9D1-AS1', 'VPS45', 'VMP1', 'VIT', 'VEGFB', 'UTP3', 'UTP18', 'USP35', 'USP32', 'UQCR11', 'UQCC2', 'UIMC1', 'YWHAB', 'YWHAZ', 'ZBED2', 'ZNF318', 'ZNF764', 'ZNF703', 'ZNF702P', 'ZNF480', 'ZNF418', 'ZNF354A', 'ZNF33B', 'ZNF326', 'ZNF263', 'ZBED4', 'ZNF202', 'ZMIZ1', 'ZHX1', 'ZFP36', 'ZFC3H1', 'ZBTB7A', 'ZBTB34', 'ZBTB2', 'UBE2Q2', 'TXNRD2', 'TBCA', 'TXN', 'TMEM70', 'TMEM259', 'TMEM258', 'TMEM256', 'TMEM238', 'TIMELESS', 'TIAM1', 'THRB', 'THAP1', 'TGFB3', 'TGDS', 'TFF3', 'TFF1', 'TEDC2-AS1', 'TCHP', 'TCF7L1', 'TCF20', 'TCEAL9', 'TBKBP1', 'TMEM80', 'TMSB4XP4', 'TNFRSF12A', 'TRAK2', 'TUBB6', 'TUBA1B', 'TTL', 'TSR1', 'TSPYL1', 'TSHZ2', 'TRIM52-AS1', 'TRIM37', 'TRAK1', 'TNFSF13B', 'TRAF3IP2', 'TPM4', 'TPM1', 'TPI1', 'TOLLIP', 'TOB1', 'TNNT1', 'TNIP2', 'RHBDD2', 'RGS10', 'RFK', 'MTND2P28', 'NCOA5', 'NCOA1', 'NCLN', 'NCL', 'NCKAP1', 'NCK1', 'NCDN', 'NCBP3', 'NCAM1', 'NBEAL2', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO5C', 'MYO10', 'MYH14', 'MXRA5', 'MXI1', 'NDRG1', 'NDUFA8', 'NDUFB4', 'NOL4L', 'NUP93', 'NSD1', 'NRG4', 'NR4A1', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'NME1-NME2', 'NDUFC1', 'NMD3', 'NLK', 'NINJ1', 'NFIC', 'NEUROD2', 'NEDD9', 'NEDD1', 'NEAT1', 'MUL1', 'MTND1P23', 'OPTN', 'MTA2', 'MRPS2', 'MRPL55', 'MPHOSPH9', 'ZNRF1', 'MPDU1', 'MNS1', 'MMP2', 'MMP1', 'MLLT6', 'MLLT3', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'MGRN1', 'MGLL', 'METTL26', 'MELTF-AS1', 'MSMB', 'MSR1', 'MT-ATP8', 'MT-TM', 'MT2A', 'MT1X', 'MT1E', 'MT-TY', 'MT-TV', 'MT-TS2', 'MT-TP', 'MT-TN', 'MT-TL1', 'MT-CO1', 'MT-TE', 'MT-TD', 'MT-RNR2', 'MT-RNR1', 'MT-ND5', 'MT-ND4', 'MT-ND3', 'MT-ND2', 'NUPR2', 'OTUD7B', 'RCC1L', 'PPIG', 'PSMD5', 'PSMD2', 'PSMD14', 'PSMA7', 'PSIP1', 'PSAP', 'PRXL2C', 'PRRC2C', 'PRRC2A', 'PRR5L', 'PRR34-AS1', 'PRR12', 'PRMT6', 'PREX1', 'PRDX1', 'PPTC7', 'PPP4R2', 'PPP1R12B', 'PPM1G', 'PSME4', 'PSMG1', 'PTGR1', 'RAB35', 'RBSN', 'RBBP6', 'RAPGEF3', 'RAI14', 'RAD23A', 'RABEP1', 'RAB5C', 'RAB3GAP1', 'RAB30', 'PTP4A2', 'RAB2B', 'RAB27A', 'RAB1B', 'RAB12', 'RAB11FIP4', 'QSOX1', 'PYGO2', 'PUSL1', 'PPIL1', 'POLR3GL', 'OVOL1', 'POLE4', 'PGAM5', 'PGAM1', 'PFKFB3', 'PFDN4', 'PDS5A', 'PDLIM1', 'PDCD4', 'PDAP1', 'PCYT1A', 'PCDHGA10', 'PCDHB1', 'PATL1', 'PARD6B', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'P4HA1', 'PHACTR1', 'PHC1', 'PHF20L1', 'PLCB4', 'POLDIP2', 'POLB', 'PMEPA1', 'PLOD2', 'PLEC', 'PLD1', 'PLCE1', 'PLCD3', 'PLBD2', 'PHLDA2', 'PLAU', 'PKM', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'AAMP']
random_forest_feature_lists = [ss_mcf7_random_forest_features, ss_hcc_random_forest_features, ds_mcf7_random_forest_features, ds_hcc_random_forest_features]
top_random_forest_features_occurrences = count_and_sort_occurrences(random_forest_feature_lists, False)
top_random_forest_features = {}
for i in range(4, 0, -1):
top_random_forest_features[i] = top_random_forest_features.get(i + 1, []) + filter_by_occurrences(top_random_forest_features_occurrences, i)
for i in range(4, 0, -1):
if len(top_random_forest_features[i]) == 0:
continue
print(f"{len(top_random_forest_features[i])} gene(s) selected for random forest across {i}+ data set(s):")
print(top_random_forest_features[i])
print()
1 gene(s) selected for random forest across 4+ data set(s): ['PGK1'] 8 gene(s) selected for random forest across 3+ data set(s): ['PGK1', 'NDRG1', 'BNIP3', 'DSP', 'P4HA1', 'BNIP3L', 'KRT19', 'GAPDH'] 51 gene(s) selected for random forest across 2+ data set(s): ['PGK1', 'NDRG1', 'BNIP3', 'DSP', 'P4HA1', 'BNIP3L', 'KRT19', 'GAPDH', 'RPS28', 'RPS27', 'RPS19', 'FUT11', 'RPS5', 'S100A10', 'S100A11', 'RPS14', 'GPI', 'RPLP2', 'RPLP1', 'RPL37A', 'RPL39', 'H4C3', 'RPL36', 'PFKFB3', 'RPL35', 'DSCAM-AS1', 'PKM', 'FGF23', 'RPL13', 'RPL12', 'EGLN3', 'ELOB', 'ENO1', 'ENO2', 'ERO1A', 'SERF2', 'FAM162A', 'MT-CO3', 'MT-CYB', 'TPI1', 'HES1', 'TMSB10', 'LDHA', 'MT-ATP6', 'BCYRN1', 'MT-CO2', 'ALDOA', 'MALAT1', 'ADM', 'UQCRQ', 'MT-RNR2'] 242 gene(s) selected for random forest across 1+ data set(s): ['PGK1', 'NDRG1', 'BNIP3', 'DSP', 'P4HA1', 'BNIP3L', 'KRT19', 'GAPDH', 'RPS28', 'RPS27', 'RPS19', 'FUT11', 'RPS5', 'S100A10', 'S100A11', 'RPS14', 'GPI', 'RPLP2', 'RPLP1', 'RPL37A', 'RPL39', 'H4C3', 'RPL36', 'PFKFB3', 'RPL35', 'DSCAM-AS1', 'PKM', 'FGF23', 'RPL13', 'RPL12', 'EGLN3', 'ELOB', 'ENO1', 'ENO2', 'ERO1A', 'SERF2', 'FAM162A', 'MT-CO3', 'MT-CYB', 'TPI1', 'HES1', 'TMSB10', 'LDHA', 'MT-ATP6', 'BCYRN1', 'MT-CO2', 'ALDOA', 'MALAT1', 'ADM', 'UQCRQ', 'MT-RNR2', 'FDPS', 'INSIG1', 'HSPB1', 'FOSL1', 'IGFBP3', 'HSPD1', 'FAM83A', 'IFITM3', 'HSPH1', 'FASN', 'FDFT1', 'HSPA5', 'IFITM2', 'H2AC12', 'HSP90AB1', 'FOSL2', 'HSP90B1', 'H1-3', 'HILPDA', 'FAM13A', 'H1-1', 'HK2', 'GYS1', 'HMGA1', 'HNRNPA2B1', 'GSTP1', 'GPRC5A', 'GPM6A', 'HNRNPM', 'HNRNPU', 'HSP90AA1', 'FTL', 'ZNF473', 'DYNC2I2', 'EZR', 'EMP2', 'C19orf53', 'BUB1B', 'BUB1', 'BTBD9', 'BLCAP', 'BHLHE40', 'BAP1', 'B4GALT1', 'ATP5MK', 'ATP5MG', 'ATP5ME', 'ATP5F1E', 'ATAD2', 'ASB2', 'ARRDC3', 'APEH', 'ANGPTL4', 'ALDOC', 'AKR1C2', 'AKR1C1', 'AHNAK2', 'ACTB', 'ACLY', 'C1orf116', 'C4orf3', 'CA9', 'CYP1B1', 'EIF5', 'EIF3J', 'EGLN1', 'EEF2', 'EEF1A1', 'EBP', 'KCNJ3', 'DNMT1', 'DDIT4', 'CYP1B1-AS1', 'CYB561A3', 'CACNA1A', 'COX7C', 'COX7A2', 'CNNM2', 'CENPF', 'CDKN1A', 'CBX3', 'CAV1', 'CAST', 'CALM2', 'CALB1', 'IRF2BP2', 'MOV10', 'KCTD11', 'KDM3A', 'SNRPD2', 'SNRNP25', 'SLC9A3R1', 'SLC6A8', 'SLC3A2', 'SLC2A1', 'SET', 'S100A6', 'RPSA', 'RPS8', 'RPS3', 'RPS2', 'RPS16', 'RPS15A', 'RPS15', 'RPS12', 'RPL8', 'RPL41', 'RPL37', 'RPL35A', 'RPL34', 'RPL30', 'RPL28', 'SNX33', 'SOX4', 'SQLE', 'TRIM44', 'ZC3H15', 'YWHAZ', 'WDR43', 'VEGFA', 'UPK1B', 'UBC', 'UBA52', 'TUBG1', 'TUBD1', 'TST', 'TPX2', 'SRM', 'TPT1', 'TPBG', 'TOB1', 'TMSB4X', 'TMEM64', 'TMEM45A', 'TMEM258', 'TFF3', 'TFF1', 'STMN1', 'RPL27A', 'RPL23', 'RPL21', 'MT-CO1', 'NEAT1', 'NDUFB2', 'NCL', 'NCALD', 'MTATP6P1', 'MT2A', 'MT-TQ', 'MT-RNR1', 'MT-ND4', 'MT-ND3', 'ZNF302', 'NPM1P40', 'MOB3A', 'MIF', 'MARCKS', 'LOXL2', 'LGALS1', 'LDHB', 'LBH', 'KYNU', 'KRT8', 'KRT18', 'NECTIN2', 'P4HA2', 'RPL15', 'PRRG3', 'RPL11', 'RPL10', 'ROMO1', 'RALGDS', 'RAC1', 'PYCR3', 'PTMS', 'PSME2', 'PSMA7', 'PRSS8', 'PPP1R3G', 'PABPC1', 'POLR2L', 'PLOD2', 'PLIN2', 'PLEC', 'PLAC8', 'PFKP', 'PFKFB4', 'PDK1', 'PDIA3', 'PARD6B', 'ACAT2']
Looking at the selected genes across all the models and data sets, certain genes that tend to have a high predictive power can be identified. The selected genes with higher occurrences may be especially useful for constructing a generalized model.
feature_lists = logit_feature_lists + svm_feature_lists + random_forest_feature_lists
top_genes_occurrences = count_and_sort_occurrences(feature_lists, False)
top_genes = {}
for i in range(12, 0, -1):
top_genes[i] = top_genes.get(i + 1, []) + filter_by_occurrences(top_genes_occurrences, i)
for i in range(12, 0, -1):
if len(top_genes[i]) == 0:
continue
print(f"{len(top_genes[i])} gene(s) selected {i}+ times:")
print(top_genes[i])
print()
1 gene(s) selected 12+ times: ['PGK1'] 1 gene(s) selected 11+ times: ['PGK1'] 2 gene(s) selected 10+ times: ['PGK1', 'MT-CYB'] 3 gene(s) selected 9+ times: ['PGK1', 'MT-CYB', 'MT-CO3'] 7 gene(s) selected 8+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA'] 10 gene(s) selected 7+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4'] 20 gene(s) selected 6+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A'] 45 gene(s) selected 5+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3'] 97 gene(s) selected 4+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A'] 155 gene(s) selected 3+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A', 'CNNM2', 'SOX4', 'SET', 'CSTB', 'CLDN4', 'CKS2', 'RPL41', 'EIF3J', 'ERO1A', 'RPL34', 'HSPA5', 'MT-ND3', 'MT-ND4', 'MT2A', 'MARCKS', 'NCL', 'NEAT1', 'KRT4', 'KPNA2', 'PARD6B', 'KDM3A', 'HSPD1', 'PLEC', 'RPL30', 'HSP90AB1', 'HEPACAM', 'H4C3', 'STMN1', 'GYS1', 'PSMA7', 'FOSL1', 'RPL15', 'RPL23', 'RPL27A', 'RPL28', 'STC2', 'MT-CO1', 'ANGPTL4', 'TMEM258', 'YWHAZ', 'UBA52', 'C19orf53', 'ATP5F1E', 'APEH', 'TOB1', 'TFF3', 'AMOTL2', 'CBX3', 'ZBTB20', 'AHNAK2', 'AURKA', 'TWNK', 'ACTB', 'B4GALT1', 'CALM2', 'TFF1', 'CAST', 'TUBA1B'] 784 gene(s) selected 2+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A', 'CNNM2', 'SOX4', 'SET', 'CSTB', 'CLDN4', 'CKS2', 'RPL41', 'EIF3J', 'ERO1A', 'RPL34', 'HSPA5', 'MT-ND3', 'MT-ND4', 'MT2A', 'MARCKS', 'NCL', 'NEAT1', 'KRT4', 'KPNA2', 'PARD6B', 'KDM3A', 'HSPD1', 'PLEC', 'RPL30', 'HSP90AB1', 'HEPACAM', 'H4C3', 'STMN1', 'GYS1', 'PSMA7', 'FOSL1', 'RPL15', 'RPL23', 'RPL27A', 'RPL28', 'STC2', 'MT-CO1', 'ANGPTL4', 'TMEM258', 'YWHAZ', 'UBA52', 'C19orf53', 'ATP5F1E', 'APEH', 'TOB1', 'TFF3', 'AMOTL2', 'CBX3', 'ZBTB20', 'AHNAK2', 'AURKA', 'TWNK', 'ACTB', 'B4GALT1', 'CALM2', 'TFF1', 'CAST', 'TUBA1B', 'IVL', 'IWS1', 'ARIH1', 'IRF2BPL', 'ITPK1', 'ISOC2', 'ISCU', 'IRAK1', 'INSIG1', 'INPP4B', 'JAKMIP3', 'ARID1B', 'JUN', 'KIF5B', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KITLG', 'KIRREL1', 'ARHGEF26', 'KIF14', 'JUND', 'KHSRP', 'KEAP1', 'ARHGEF7', 'KCTD11', 'KCNQ1OT1', 'ARID5B', 'KAT7', 'INHBA', 'CDKN1A', 'ING2', 'HMGB2', 'HIF3A', 'HEY1', 'HES4', 'ATRX', 'HELQ', 'HCFC1', 'H4C5', 'ATXN1L', 'H3C2', 'H2BC9', 'H2BC4', 'H2AC20', 'H2AC16', 'AXL', 'BAZ2A', 'GSE1', 'GRK2', 'HILPDA', 'HNRNPA2B1', 'INF2', 'HOXC13', 'INCENP', 'ILRUN', 'IGFBP5', 'ARHGAP42', 'ARL13B', 'ARL2', 'IFITM3', 'IFI27L2', 'ARNTL', 'ARNTL2', 'HSPA8', 'ARPP19', 'HSP90B1', 'ARSA', 'ARTN', 'ATF5', 'ATN1', 'ARHGDIA', 'ARF3', 'ARHGAP26', 'MARK3', 'MLLT6', 'MLLT3', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'MIF', 'MGRN1', 'MGLL', 'METTL26', 'MELTF-AS1', 'MED18', 'MDM2', 'MCM4', 'MCM3AP', 'MB', 'MMP1', 'AMFR', 'AKR1C3', 'MSMB', 'ABL1', 'ABO', 'ACAT2', 'ACSL4', 'MSR1', 'ADARB1', 'AFF1', 'AJAP1', 'MMP2', 'AK4', 'AKAP5', 'MRPL55', 'MPHOSPH9', 'MPHOSPH6', 'MPDU1', 'ZRANB1', 'MAZ', 'MARK2', 'KPNA4', 'MAPKAPK2', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'ANKRD9', 'LETM1', 'LDLRAP1', 'ANXA6', 'LDHB', 'LCLAT1', 'LAD1', 'KYNU', 'KRT80', 'KRT8', 'BBOF1', 'ARFGEF1', 'KRT18', 'LINC01304', 'LINC01902', 'LINC02367', 'LXN', 'MAP3K13', 'MAP2K3', 'MALAT1', 'ANKEF1', 'MAD2L1', 'LYAR', 'ANKRD17', 'LTBR', 'LINC02511', 'LRRFIP2', 'LPP', 'ANKRD40', 'LOXL2', 'LMNB2', 'ANKRD52', 'LINC02541', 'GREM1', 'BCL3', 'GPRC5A', 'DNAJA3', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'CAVIN3', 'DHCR7', 'DGKZ', 'DGKD', 'DERA', 'DDX54', 'CBFA2T3', 'DDX23', 'DDIT3', 'DCTN1', 'DNAJA1', 'DNAJA4', 'DBT', 'DNAJC21', 'CAPZA1', 'CARM1', 'CASP8AP2', 'EGLN1', 'EFNA5', 'EFNA2', 'ECH1', 'EBAG9', 'DYSF', 'DYNC2I2', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DTNB', 'DSCAM-AS1', 'DNMT3A', 'DCAKD', 'DBNDD1', 'EIF2B4', 'CNOT6L', 'CMIP', 'CLTB', 'CLSPN', 'CLIP2', 'CLIC1', 'CCNG2', 'CD44', 'CITED2', 'CD47', 'CDC20', 'CIAO2A', 'CHAC2', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CCDC34', 'CNOT9', 'DANT1', 'CNR2', 'DAAM1', 'CYP1B1', 'CTXN1', 'CCDC168', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CCDC18', 'CPTP', 'CPNE2', 'CPEB1', 'COX8A', 'COL6A3', 'EHBP1L1', 'EIF3A', 'BCAS3', 'GAB2', 'BLOC1S3', 'FTL', 'FSD1L', 'FRY', 'FRS2', 'BMS1', 'BOLA3', 'FOS', 'FLOT2', 'FLNA', 'FGF8', 'BRAT1', 'BRIP1', 'FDFT1', 'BRMS1', 'FBXL18', 'FBXL17', 'FYN', 'GABPB2', 'FBXL16', 'GABRE', 'GPATCH4', 'CENPB', 'BHLHE40', 'GOLGA3', 'GNAQ', 'GLE1', 'GJB3', 'GIN1', 'GDPGP1', 'GDI1', 'GDF15', 'GDAP2', 'GCAT', 'GBP1P1', 'GATAD2A', 'BICDL1', 'BLCAP', 'BRPF3', 'FBRS', 'CAP1', 'FAH', 'ETF1', 'ESRP2', 'MT-ATP8', 'EPPK1', 'EPHX1', 'ENTR1', 'ENOX2', 'ENO2', 'ENKD1', 'CACNB2', 'CACNG4', 'ELP3', 'ELOB', 'ELOA', 'CALHM2', 'EIF5', 'EIF4G2', 'EXOC7', 'CACHD1', 'FASTKD5', 'CA9', 'FARP1', 'FAM83A', 'FAM50A', 'FAM189B', 'FAM177A1', 'BTBD7P1', 'C16orf91', 'FAM13B', 'C1orf53', 'C2orf49', 'C6orf62', 'C7orf50', 'FAM126B', 'FAM111B', 'FAM104A', 'FAM102A', 'C9orf78', 'MNS1', 'AAMP', 'SYT14', 'RAB27A', 'TSHZ2', 'RAB3GAP1', 'SNRNP70', 'RAB35', 'RAB30', 'TBC1D9', 'RAB2B', 'RAB1B', 'TRAF3IP2', 'RAB12', 'RAB11FIP4', 'TSPYL1', 'PYGO2', 'TSR1', 'SNX22', 'PTP4A2', 'RAB5C', 'RABEP1', 'TRIM52-AS1', 'RAD23A', 'RNF122', 'RHOT2', 'RHBDD2', 'RGS10', 'RFK', 'TRAK1', 'RCC1L', 'TRAK2', 'RBSN', 'RBBP6', 'TBKBP1', 'RAPGEF3', 'TBCA', 'TRIM37', 'RAI14', 'SNX24', 'PTGR1', 'PSMG1', 'SPG21', 'PRR12', 'SPN', 'PRMT6', 'PREX1', 'PRDX1', 'TARS1', 'TAOK3', 'PPTC7', 'PPP4R2', 'SPRY1', 'PPP1R12B', 'TAF9B', 'PPM1G', 'PPIL1', 'PPIG', 'TXNRD2', 'TXN', 'TTL', 'PRR34-AS1', 'PSME4', 'SNX27', 'PSMD5', 'PSMD2', 'PSMD14', 'PSIP1', 'SOCS2', 'PRXL2C', 'TUBB6', 'SOS1', 'SPATS2L', 'PRRC2C', 'PRRC2A', 'PRR5L', 'TATDN2', 'RNF146', 'RNF25', 'TCHP', 'SCYL2', 'SETD2', 'SERINC5', 'SERF2', 'SENP6', 'THRB', 'TIAM1', 'SECISBP2L', 'TIMELESS', 'TPX2', 'SAT2', 'SART1', 'SMC6', 'SAMD4A', 'TMEM238', 'TMEM256', 'S100P', 'TCF7L1', 'SETD3', 'SF3B4', 'SH3RF1', 'TEDC2-AS1', 'SLC48A1', 'SMARCB1', 'TGDS', 'TGFB3', 'SLC2A6', 'THAP1', 'SLC25A24', 'SMC5', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'SIGMAR1', 'SHOX', 'SHISA5', 'TMEM259', 'SMIM27', 'RTL8C', 'SNHG18', 'RPLP1', 'RPLP0P2', 'TNNT1', 'TOLLIP', 'SNHG9', 'TCF20', 'RPL36', 'SNORD3B-1', 'RPL35', 'TPBG', 'TPM1', 'RPL17', 'RPL12', 'TPM4', 'TCEAL9', 'RPLP2', 'RPS14', 'RSRC2', 'TNIP2', 'RRS1', 'RRP1B', 'RRAS', 'TMEM70', 'TMEM80', 'SMKR1', 'RPS5', 'TMSB4XP4', 'RPS29', 'RPS28', 'RPS21', 'TNFRSF12A', 'RPS19', 'TNFSF13B', 'MT-ND2', 'SRA1', 'SRCAP', 'POLR3GL', 'NDUFB4', 'NFIC', 'NEUROD2', 'NEDD9', 'ZNF263', 'NEDD1', 'ZNF318', 'NDUFC1', 'ZNF326', 'UBE2Q2', 'NDUFA8', 'ZNF33B', 'NCOA5', 'NCOA1', 'NCLN', 'NCKAP1', 'NCK1', 'NINJ1', 'NLK', 'NMD3', 'NME1-NME2', 'NUP93', 'ZBTB7A', 'ZFC3H1', 'NSD1', 'TAF13', 'ZFP36', 'ZHX1', 'NRG4', 'ZMIZ1', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'ZNF202', 'NOL4L', 'NCDN', 'NCBP3', 'NCAM1', 'MT1X', 'MT-TY', 'MT-TV', 'MT-TS2', 'MT-TP', 'MT-TN', 'ZNF702P', 'SYNJ2', 'MT-TM', 'MT-TL1', 'MT-TE', 'ZNF703', 'MT-TD', 'ZNF764', 'MT-ND5', 'ZNRF1', 'MT1E', 'MTA2', 'NBEAL2', 'ZNF480', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO5C', 'ZNF354A', 'MYO10', 'MYH14', 'ZNF418', 'MXRA5', 'MXI1', 'SULF2', 'MTND2P28', 'SYNE2', 'MTND1P23', 'NUPR2', 'OPTN', 'OTUD7B', 'PLCD3', 'PLCB4', 'PLBD2', 'USP35', 'UTP18', 'PLAU', 'UTP3', 'STARD10', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'PHLDA2', 'PHF20L1', 'USP32', 'PLCE1', 'PHACTR1', 'PLD1', 'SREK1IP1P1', 'SRFBP1', 'SRM', 'UIMC1', 'POLE4', 'POLDIP2', 'UPK1B', 'POLB', 'TAF15', 'UQCC2', 'PMEPA1', 'SSX2IP', 'UQCR11', 'UQCRQ', 'PLIN2', 'STRBP', 'VEGFB', 'OVOL1', 'PCYT1A', 'YTHDF1', 'PCDHGA10', 'PCDHB1', 'YWHAB', 'ZBED2', 'ZBED4', 'PATL1', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'ZBTB2', 'P4HA2', 'ZBTB34', 'YKT6', 'PDAP1', 'PFDN4', 'PDCD4', 'VIT', 'PGAM5', 'VMP1', 'PGAM1', 'VPS45', 'VPS9D1-AS1', 'VRK3', 'SLC9A3R1', 'WDR77', 'PDS5A', 'WSB2', 'PDLIM1', 'STRIP1', 'WWC3', 'XBP1'] 979 gene(s) selected 1+ times: ['PGK1', 'MT-CYB', 'MT-CO3', 'MT-CO2', 'TMSB10', 'KRT19', 'LDHA', 'MT-ATP6', 'GAPDH', 'DDIT4', 'BNIP3L', 'HMGA1', 'P4HA1', 'BNIP3', 'NDRG1', 'FGF23', 'DSP', 'S100A11', 'S100A10', 'FAM162A', 'AKR1C2', 'MT-TQ', 'H2AC12', 'KCNJ3', 'FUT11', 'CAV1', 'BAP1', 'TRIM44', 'CACNA1A', 'WDR43', 'ADM', 'NPM1P40', 'LGALS1', 'PFKFB3', 'EGLN3', 'GPI', 'EMP2', 'ZNF302', 'HES1', 'NCALD', 'GPM6A', 'BTBD9', 'ZC3H15', 'ALDOA', 'PRRG3', 'CAMK2N1', 'BCYRN1', 'GATA3', 'RPSAP48', 'GOLGA4', 'PROSER1', 'BMPR1B', 'FOSL2', 'CAMSAP2', 'RPS6KA6', 'FEM1A', 'RGPD4-AS1', 'RHOD', 'RPL13', 'TPI1', 'TPD52L1', 'BTN3A2', 'RPS27', 'ENO1', 'C4orf3', 'TMEM64', 'H19', 'HSPH1', 'H2AC11', 'NEDD4L', 'MT-ND1', 'ZNF688', 'MT-ND4L', 'MT-ND6', 'MT-RNR1', 'MT-RNR2', 'MT-TA', 'AKR1C1', 'AKT1S1', 'ALDOC', 'MT-TS1', 'ATXN2L', 'YTHDF3', 'MAFF', 'APOOL', 'KCNJ2', 'PLOD2', 'HSP90AA1', 'RPL39', 'ARMC6', 'SLC6A8', 'IGFBP3', 'PKM', 'SLC25A48', 'KMT2D', 'SLC2A1', 'RPL37A', 'CNNM2', 'SOX4', 'SET', 'CSTB', 'CLDN4', 'CKS2', 'RPL41', 'EIF3J', 'ERO1A', 'RPL34', 'HSPA5', 'MT-ND3', 'MT-ND4', 'MT2A', 'MARCKS', 'NCL', 'NEAT1', 'KRT4', 'KPNA2', 'PARD6B', 'KDM3A', 'HSPD1', 'PLEC', 'RPL30', 'HSP90AB1', 'HEPACAM', 'H4C3', 'STMN1', 'GYS1', 'PSMA7', 'FOSL1', 'RPL15', 'RPL23', 'RPL27A', 'RPL28', 'STC2', 'MT-CO1', 'ANGPTL4', 'TMEM258', 'YWHAZ', 'UBA52', 'C19orf53', 'ATP5F1E', 'APEH', 'TOB1', 'TFF3', 'AMOTL2', 'CBX3', 'ZBTB20', 'AHNAK2', 'AURKA', 'TWNK', 'ACTB', 'B4GALT1', 'CALM2', 'TFF1', 'CAST', 'TUBA1B', 'IVL', 'IWS1', 'ARIH1', 'IRF2BPL', 'ITPK1', 'ISOC2', 'ISCU', 'IRAK1', 'INSIG1', 'INPP4B', 'JAKMIP3', 'ARID1B', 'JUN', 'KIF5B', 'KLLN', 'KLHL8', 'KLC2', 'KLC1', 'KITLG', 'KIRREL1', 'ARHGEF26', 'KIF14', 'JUND', 'KHSRP', 'KEAP1', 'ARHGEF7', 'KCTD11', 'KCNQ1OT1', 'ARID5B', 'KAT7', 'INHBA', 'CDKN1A', 'ING2', 'HMGB2', 'HIF3A', 'HEY1', 'HES4', 'ATRX', 'HELQ', 'HCFC1', 'H4C5', 'ATXN1L', 'H3C2', 'H2BC9', 'H2BC4', 'H2AC20', 'H2AC16', 'AXL', 'BAZ2A', 'GSE1', 'GRK2', 'HILPDA', 'HNRNPA2B1', 'INF2', 'HOXC13', 'INCENP', 'ILRUN', 'IGFBP5', 'ARHGAP42', 'ARL13B', 'ARL2', 'IFITM3', 'IFI27L2', 'ARNTL', 'ARNTL2', 'HSPA8', 'ARPP19', 'HSP90B1', 'ARSA', 'ARTN', 'ATF5', 'ATN1', 'ARHGDIA', 'ARF3', 'ARHGAP26', 'MARK3', 'MLLT6', 'MLLT3', 'MKNK1', 'MIXL1', 'MIR663AHG', 'MIR210HG', 'MIOS-DT', 'MIF', 'MGRN1', 'MGLL', 'METTL26', 'MELTF-AS1', 'MED18', 'MDM2', 'MCM4', 'MCM3AP', 'MB', 'MMP1', 'AMFR', 'AKR1C3', 'MSMB', 'ABL1', 'ABO', 'ACAT2', 'ACSL4', 'MSR1', 'ADARB1', 'AFF1', 'AJAP1', 'MMP2', 'AK4', 'AKAP5', 'MRPL55', 'MPHOSPH9', 'MPHOSPH6', 'MPDU1', 'ZRANB1', 'MAZ', 'MARK2', 'KPNA4', 'MAPKAPK2', 'LINC01291', 'LINC01133', 'LINC01116', 'LIMCH1', 'ANKRD9', 'LETM1', 'LDLRAP1', 'ANXA6', 'LDHB', 'LCLAT1', 'LAD1', 'KYNU', 'KRT80', 'KRT8', 'BBOF1', 'ARFGEF1', 'KRT18', 'LINC01304', 'LINC01902', 'LINC02367', 'LXN', 'MAP3K13', 'MAP2K3', 'MALAT1', 'ANKEF1', 'MAD2L1', 'LYAR', 'ANKRD17', 'LTBR', 'LINC02511', 'LRRFIP2', 'LPP', 'ANKRD40', 'LOXL2', 'LMNB2', 'ANKRD52', 'LINC02541', 'GREM1', 'BCL3', 'GPRC5A', 'DNAJA3', 'DNAH11', 'DNAAF5', 'DLD', 'DKK1', 'DKC1', 'DHX38', 'DHX37', 'CAVIN3', 'DHCR7', 'DGKZ', 'DGKD', 'DERA', 'DDX54', 'CBFA2T3', 'DDX23', 'DDIT3', 'DCTN1', 'DNAJA1', 'DNAJA4', 'DBT', 'DNAJC21', 'CAPZA1', 'CARM1', 'CASP8AP2', 'EGLN1', 'EFNA5', 'EFNA2', 'ECH1', 'EBAG9', 'DYSF', 'DYNC2I2', 'DVL3', 'DUSP9', 'DUSP5', 'DTYMK', 'DTNB', 'DSCAM-AS1', 'DNMT3A', 'DCAKD', 'DBNDD1', 'EIF2B4', 'CNOT6L', 'CMIP', 'CLTB', 'CLSPN', 'CLIP2', 'CLIC1', 'CCNG2', 'CD44', 'CITED2', 'CD47', 'CDC20', 'CIAO2A', 'CHAC2', 'CFAP97', 'CFAP251', 'CERS2', 'CEP83', 'CEP120', 'CCDC34', 'CNOT9', 'DANT1', 'CNR2', 'DAAM1', 'CYP1B1', 'CTXN1', 'CCDC168', 'CSNK2A2', 'CSK', 'CS', 'CRTC2', 'CRNDE', 'CRIP2', 'CREB1', 'CCDC18', 'CPTP', 'CPNE2', 'CPEB1', 'COX8A', 'COL6A3', 'EHBP1L1', 'EIF3A', 'BCAS3', 'GAB2', 'BLOC1S3', 'FTL', 'FSD1L', 'FRY', 'FRS2', 'BMS1', 'BOLA3', 'FOS', 'FLOT2', 'FLNA', 'FGF8', 'BRAT1', 'BRIP1', 'FDFT1', 'BRMS1', 'FBXL18', 'FBXL17', 'FYN', 'GABPB2', 'FBXL16', 'GABRE', 'GPATCH4', 'CENPB', 'BHLHE40', 'GOLGA3', 'GNAQ', 'GLE1', 'GJB3', 'GIN1', 'GDPGP1', 'GDI1', 'GDF15', 'GDAP2', 'GCAT', 'GBP1P1', 'GATAD2A', 'BICDL1', 'BLCAP', 'BRPF3', 'FBRS', 'CAP1', 'FAH', 'ETF1', 'ESRP2', 'MT-ATP8', 'EPPK1', 'EPHX1', 'ENTR1', 'ENOX2', 'ENO2', 'ENKD1', 'CACNB2', 'CACNG4', 'ELP3', 'ELOB', 'ELOA', 'CALHM2', 'EIF5', 'EIF4G2', 'EXOC7', 'CACHD1', 'FASTKD5', 'CA9', 'FARP1', 'FAM83A', 'FAM50A', 'FAM189B', 'FAM177A1', 'BTBD7P1', 'C16orf91', 'FAM13B', 'C1orf53', 'C2orf49', 'C6orf62', 'C7orf50', 'FAM126B', 'FAM111B', 'FAM104A', 'FAM102A', 'C9orf78', 'MNS1', 'AAMP', 'SYT14', 'RAB27A', 'TSHZ2', 'RAB3GAP1', 'SNRNP70', 'RAB35', 'RAB30', 'TBC1D9', 'RAB2B', 'RAB1B', 'TRAF3IP2', 'RAB12', 'RAB11FIP4', 'TSPYL1', 'PYGO2', 'TSR1', 'SNX22', 'PTP4A2', 'RAB5C', 'RABEP1', 'TRIM52-AS1', 'RAD23A', 'RNF122', 'RHOT2', 'RHBDD2', 'RGS10', 'RFK', 'TRAK1', 'RCC1L', 'TRAK2', 'RBSN', 'RBBP6', 'TBKBP1', 'RAPGEF3', 'TBCA', 'TRIM37', 'RAI14', 'SNX24', 'PTGR1', 'PSMG1', 'SPG21', 'PRR12', 'SPN', 'PRMT6', 'PREX1', 'PRDX1', 'TARS1', 'TAOK3', 'PPTC7', 'PPP4R2', 'SPRY1', 'PPP1R12B', 'TAF9B', 'PPM1G', 'PPIL1', 'PPIG', 'TXNRD2', 'TXN', 'TTL', 'PRR34-AS1', 'PSME4', 'SNX27', 'PSMD5', 'PSMD2', 'PSMD14', 'PSIP1', 'SOCS2', 'PRXL2C', 'TUBB6', 'SOS1', 'SPATS2L', 'PRRC2C', 'PRRC2A', 'PRR5L', 'TATDN2', 'RNF146', 'RNF25', 'TCHP', 'SCYL2', 'SETD2', 'SERINC5', 'SERF2', 'SENP6', 'THRB', 'TIAM1', 'SECISBP2L', 'TIMELESS', 'TPX2', 'SAT2', 'SART1', 'SMC6', 'SAMD4A', 'TMEM238', 'TMEM256', 'S100P', 'TCF7L1', 'SETD3', 'SF3B4', 'SH3RF1', 'TEDC2-AS1', 'SLC48A1', 'SMARCB1', 'TGDS', 'TGFB3', 'SLC2A6', 'THAP1', 'SLC25A24', 'SMC5', 'SLC13A5', 'SLAIN2', 'SINHCAFP3', 'SIGMAR1', 'SHOX', 'SHISA5', 'TMEM259', 'SMIM27', 'RTL8C', 'SNHG18', 'RPLP1', 'RPLP0P2', 'TNNT1', 'TOLLIP', 'SNHG9', 'TCF20', 'RPL36', 'SNORD3B-1', 'RPL35', 'TPBG', 'TPM1', 'RPL17', 'RPL12', 'TPM4', 'TCEAL9', 'RPLP2', 'RPS14', 'RSRC2', 'TNIP2', 'RRS1', 'RRP1B', 'RRAS', 'TMEM70', 'TMEM80', 'SMKR1', 'RPS5', 'TMSB4XP4', 'RPS29', 'RPS28', 'RPS21', 'TNFRSF12A', 'RPS19', 'TNFSF13B', 'MT-ND2', 'SRA1', 'SRCAP', 'POLR3GL', 'NDUFB4', 'NFIC', 'NEUROD2', 'NEDD9', 'ZNF263', 'NEDD1', 'ZNF318', 'NDUFC1', 'ZNF326', 'UBE2Q2', 'NDUFA8', 'ZNF33B', 'NCOA5', 'NCOA1', 'NCLN', 'NCKAP1', 'NCK1', 'NINJ1', 'NLK', 'NMD3', 'NME1-NME2', 'NUP93', 'ZBTB7A', 'ZFC3H1', 'NSD1', 'TAF13', 'ZFP36', 'ZHX1', 'NRG4', 'ZMIZ1', 'NQO1', 'NPLOC4', 'NOP10', 'NOM1', 'ZNF202', 'NOL4L', 'NCDN', 'NCBP3', 'NCAM1', 'MT1X', 'MT-TY', 'MT-TV', 'MT-TS2', 'MT-TP', 'MT-TN', 'ZNF702P', 'SYNJ2', 'MT-TM', 'MT-TL1', 'MT-TE', 'ZNF703', 'MT-TD', 'ZNF764', 'MT-ND5', 'ZNRF1', 'MT1E', 'MTA2', 'NBEAL2', 'ZNF480', 'NAXD', 'NACC1', 'NACA4P', 'NAA10', 'MYO5C', 'ZNF354A', 'MYO10', 'MYH14', 'ZNF418', 'MXRA5', 'MXI1', 'SULF2', 'MTND2P28', 'SYNE2', 'MTND1P23', 'NUPR2', 'OPTN', 'OTUD7B', 'PLCD3', 'PLCB4', 'PLBD2', 'USP35', 'UTP18', 'PLAU', 'UTP3', 'STARD10', 'PKIB', 'PITX1', 'PITPNA', 'PICALM', 'PI4KB', 'PHRF1', 'PHLDA2', 'PHF20L1', 'USP32', 'PLCE1', 'PHACTR1', 'PLD1', 'SREK1IP1P1', 'SRFBP1', 'SRM', 'UIMC1', 'POLE4', 'POLDIP2', 'UPK1B', 'POLB', 'TAF15', 'UQCC2', 'PMEPA1', 'SSX2IP', 'UQCR11', 'UQCRQ', 'PLIN2', 'STRBP', 'VEGFB', 'OVOL1', 'PCYT1A', 'YTHDF1', 'PCDHGA10', 'PCDHB1', 'YWHAB', 'ZBED2', 'ZBED4', 'PATL1', 'PAQR8', 'PAQR7', 'PAPOLA', 'PAK2', 'PACS1', 'ZBTB2', 'P4HA2', 'ZBTB34', 'YKT6', 'PDAP1', 'PFDN4', 'PDCD4', 'VIT', 'PGAM5', 'VMP1', 'PGAM1', 'VPS45', 'VPS9D1-AS1', 'VRK3', 'SLC9A3R1', 'WDR77', 'PDS5A', 'WSB2', 'PDLIM1', 'STRIP1', 'WWC3', 'XBP1', 'TFRC', 'CDC6', 'ATP9A', 'CD9', 'ATP5MK', 'ATP5MG', 'CDC25B', 'ATP5ME', 'UBC', 'UGDH', 'ATAD2', 'ASB2', 'CDK2AP2', 'TYSND1', 'ARRDC3', 'ARPC1B', 'CEACAM5', 'VCPIP1', 'VEGFA', 'WTAPP1', 'ALDH1A3', 'ZNF473', 'ADAP1', 'ACLY', 'UBB', 'TXNIP', 'CANX', 'TRIM29', 'THBS1', 'CALB1', 'TMEM45A', 'TMSB4X', 'C1orf116', 'C10orf55', 'BUB1B', 'BUB1', 'TPT1', 'TRIM16', 'TSPO', 'TUBG1', 'BIRC5', 'BEND7', 'TST', 'TUBB', 'CCM2', 'TUBB4B', 'CENPF', 'CCNB1', 'CCNB2', 'BAG3', 'TUBD1', 'RPS15A', 'SYTL2', 'ID3', 'HNRNPM', 'HNRNPU', 'HPCAL1', 'HRH1', 'PLK2', 'HSPB1', 'IER2', 'HLA-A', 'IFITM2', 'PLAC8', 'PHC1', 'IRF2BP2', 'IRF6', 'ISG15', 'HMGCS1', 'HK2', 'PFKP', 'H2AX', 'GPX2', 'GSTP1', 'H1-0', 'H1-1', 'H1-3', 'PRNP', 'PRC1', 'POLR2A', 'PPP1R3G', 'HBP1', 'HERPUD1', 'PPIF', 'POLR3A', 'POLR2L', 'ITGA6', 'PFKFB4', 'PRSS8', 'MUL1', 'LMNA', 'NECTIN2', 'NDUFB2', 'LY6D', 'MCM7', 'MYC', 'MIF-AS1', 'NR4A1', 'MTATP6P1', 'MOB3A', 'MOV10', 'MRNIP', 'MRPS2', 'MSMO1', 'NOLC1', 'LBH', 'PERP', 'PCDH1', 'JUNB', 'JUP', 'PDK1', 'PDIA3', 'PCNA', 'KDM5B', 'KIF23', 'LAMB3', 'KIF2C', 'PABPC1', 'KNSTRN', 'NUP188', 'NT5C', 'NRP1', 'PRSS23', 'PSAP', 'CHAC1', 'DHRS3', 'DCBLD2', 'SLCO4A1', 'DDX5', 'SLC3A2', 'SLC39A6', 'SLC38A2', 'SLC20A1', 'CYB561A3', 'DNMT1', 'SENP3', 'SEMA4B', 'SCD', 'EBP', 'EEF1A1', 'CYP1B1-AS1', 'CXCL1', 'S100A6', 'SPP1', 'CKAP2', 'SRXN1', 'CLDN7', 'SRSF8', 'SQSTM1', 'SQLE', 'COX7A2', 'SNRNP25', 'COX7C', 'CPEB4', 'SPAG5', 'CRABP2', 'SNX33', 'SNRPD2', 'EEF2', 'S100A2', 'PSME2', 'FSCN1', 'FDPS', 'FGD5-AS1', 'FGFBP1', 'RALGDS', 'FN1', 'RAC1', 'FTH1', 'ROMO1', 'FYB1', 'QSOX1', 'PYCR3', 'PUSL1', 'PTMS', 'GFRA1', 'FASN', 'RPL10', 'EIF4A2', 'RPS12', 'RPSA', 'RPS8', 'RPS3', 'RPS2', 'RPS16', 'RPS15', 'EZR', 'RPL11', 'F3', 'RPL8', 'RPL37', 'RPL35A', 'FAM13A', 'RPL21', 'ZWINT']
Top three selected genes:
PGK1: Encodes Phosphoglycerate Kinase 1, a key metabolic enzyme.- Under hypoxic conditions, cells shift from oxidative phosphorylation to anaerobic glycolysis for energy production.
PGK1plays a key role in the anaerobic pathway, allowing cells to produce ATP even in the absence of oxygen.
- Under hypoxic conditions, cells shift from oxidative phosphorylation to anaerobic glycolysis for energy production.
MT-CYB: Encodes Cytochrome B, a protein involved in mitochondria.- Component of the mitochondrial electron transport chain (ETC), whose efficiency depends on oxygen. Under hypoxia, ETC is disrupted, leading to mitochondrial stress, altered expression, and possibly reduced oxidative phosphorylation.
MT-CO3: Encodes Cytochrome C Oxidase 3, a protein involved in mitochondria.- Involved in the final step of ETC, reducing oxygen to water. This gene is directly affected by oxygen, or the lack thereof.
These are just a few examples of the biological background of the selected genes. Based on the function behind each gene, it is clear why they have been selected by the models.
Top principal components¶
Since PCA depends on the data set, feature selection for the principal components is done by data set across all the models.
ss_mcf7_top_pcs_occurrences = count_and_sort_occurrences([
ss_mcf7_pca_logit_pcs,
ss_mcf7_pca_svm_pcs,
ss_mcf7_pca_random_forest_pcs
], False)
ss_mcf7_top_pcs = {}
for i in range(3, 0, -1):
ss_mcf7_top_pcs[i] = ss_mcf7_top_pcs.get(i + 1, []) + filter_by_occurrences(ss_mcf7_top_pcs_occurrences, i)
for i in range(3, 0, -1):
if len(ss_mcf7_top_pcs[i]) == 0:
continue
print(f"{len(ss_mcf7_top_pcs[i])} PC(s) selected for SmartSeq MCF7 across {i}+ model(s):")
print(ss_mcf7_top_pcs[i])
print()
3 PC(s) selected for SmartSeq MCF7 across 3+ model(s): [6, 3, 1] 8 PC(s) selected for SmartSeq MCF7 across 2+ model(s): [6, 3, 1, 18, 17, 16, 12, 8] 13 PC(s) selected for SmartSeq MCF7 across 1+ model(s): [6, 3, 1, 18, 17, 16, 12, 8, 15, 9, 5, 4, 2]
ss_hcc_top_pcs_occurrences = count_and_sort_occurrences([
ss_hcc_pca_logit_pcs,
ss_hcc_pca_svm_pcs,
ss_hcc_pca_random_forest_pcs
], False)
ss_hcc_top_pcs = {}
for i in range(3, 0, -1):
ss_hcc_top_pcs[i] = ss_hcc_top_pcs.get(i + 1, []) + filter_by_occurrences(ss_hcc_top_pcs_occurrences, i)
for i in range(3, 0, -1):
if len(ss_hcc_top_pcs[i]) == 0:
continue
print(f"{len(ss_hcc_top_pcs[i])} PC(s) selected for SmartSeq HCC across {i}+ models(s):")
print(ss_hcc_top_pcs[i])
print()
2 PC(s) selected for SmartSeq HCC across 3+ models(s): [3, 2] 7 PC(s) selected for SmartSeq HCC across 2+ models(s): [3, 2, 26, 17, 12, 10, 9] 15 PC(s) selected for SmartSeq HCC across 1+ models(s): [3, 2, 26, 17, 12, 10, 9, 32, 30, 23, 21, 16, 15, 13, 4]
ds_mcf7_top_pcs_occurrences = count_and_sort_occurrences([
ds_mcf7_pca_logit_pcs,
ds_mcf7_pca_svm_pcs,
ds_mcf7_pca_random_forest_pcs
], False)
ds_mcf7_top_pcs = {}
for i in range(3, 0, -1):
ds_mcf7_top_pcs[i] = ds_mcf7_top_pcs.get(i + 1, []) + filter_by_occurrences(ds_mcf7_top_pcs_occurrences, i)
for i in range(3, 0, -1):
if len(ds_mcf7_top_pcs[i]) == 0:
continue
print(f"{len(ds_mcf7_top_pcs[i])} PC(s) selected for DropSeq MCF7 across {i}+ models(s):")
print(ds_mcf7_top_pcs[i])
print()
42 PC(s) selected for DropSeq MCF7 across 3+ models(s): [1, 149, 82, 95, 110, 116, 121, 140, 142, 145, 147, 157, 55, 170, 232, 442, 317, 318, 322, 344, 380, 383, 60, 167, 26, 16, 36, 25, 32, 8, 15, 33, 17, 6, 37, 27, 28, 3, 18, 2, 19, 5] 278 PC(s) selected for DropSeq MCF7 across 2+ models(s): [1, 149, 82, 95, 110, 116, 121, 140, 142, 145, 147, 157, 55, 170, 232, 442, 317, 318, 322, 344, 380, 383, 60, 167, 26, 16, 36, 25, 32, 8, 15, 33, 17, 6, 37, 27, 28, 3, 18, 2, 19, 5, 267, 43, 271, 273, 275, 264, 252, 263, 254, 281, 249, 247, 239, 236, 235, 234, 231, 230, 219, 218, 213, 212, 279, 302, 282, 286, 385, 381, 377, 375, 371, 370, 364, 361, 353, 352, 350, 348, 342, 341, 339, 756, 332, 327, 323, 320, 319, 312, 305, 206, 293, 291, 287, 211, 200, 205, 31, 114, 112, 107, 105, 104, 100, 99, 94, 92, 91, 88, 87, 85, 81, 74, 71, 69, 66, 65, 62, 61, 57, 56, 40, 52, 46, 45, 115, 118, 204, 119, 203, 201, 44, 198, 195, 193, 191, 190, 188, 177, 175, 173, 172, 389, 161, 160, 153, 146, 141, 29, 138, 135, 133, 128, 127, 30, 120, 387, 758, 391, 543, 552, 555, 556, 557, 564, 565, 576, 580, 582, 585, 591, 596, 597, 598, 599, 602, 606, 546, 541, 486, 540, 491, 494, 495, 496, 497, 499, 504, 507, 508, 510, 512, 517, 520, 522, 392, 534, 538, 610, 612, 615, 621, 696, 698, 700, 702, 718, 722, 724, 726, 732, 733, 734, 741, 742, 746, 751, 754, 755, 682, 681, 677, 647, 623, 626, 631, 632, 633, 642, 646, 650, 675, 652, 653, 655, 658, 661, 672, 674, 487, 527, 485, 411, 406, 466, 464, 462, 461, 460, 429, 459, 409, 455, 449, 403, 415, 438, 418, 419, 436, 435, 434, 433, 484, 431, 469, 408, 402, 393, 475, 481, 400, 399, 401, 398, 471, 470] 372 PC(s) selected for DropSeq MCF7 across 1+ models(s): [1, 149, 82, 95, 110, 116, 121, 140, 142, 145, 147, 157, 55, 170, 232, 442, 317, 318, 322, 344, 380, 383, 60, 167, 26, 16, 36, 25, 32, 8, 15, 33, 17, 6, 37, 27, 28, 3, 18, 2, 19, 5, 267, 43, 271, 273, 275, 264, 252, 263, 254, 281, 249, 247, 239, 236, 235, 234, 231, 230, 219, 218, 213, 212, 279, 302, 282, 286, 385, 381, 377, 375, 371, 370, 364, 361, 353, 352, 350, 348, 342, 341, 339, 756, 332, 327, 323, 320, 319, 312, 305, 206, 293, 291, 287, 211, 200, 205, 31, 114, 112, 107, 105, 104, 100, 99, 94, 92, 91, 88, 87, 85, 81, 74, 71, 69, 66, 65, 62, 61, 57, 56, 40, 52, 46, 45, 115, 118, 204, 119, 203, 201, 44, 198, 195, 193, 191, 190, 188, 177, 175, 173, 172, 389, 161, 160, 153, 146, 141, 29, 138, 135, 133, 128, 127, 30, 120, 387, 758, 391, 543, 552, 555, 556, 557, 564, 565, 576, 580, 582, 585, 591, 596, 597, 598, 599, 602, 606, 546, 541, 486, 540, 491, 494, 495, 496, 497, 499, 504, 507, 508, 510, 512, 517, 520, 522, 392, 534, 538, 610, 612, 615, 621, 696, 698, 700, 702, 718, 722, 724, 726, 732, 733, 734, 741, 742, 746, 751, 754, 755, 682, 681, 677, 647, 623, 626, 631, 632, 633, 642, 646, 650, 675, 652, 653, 655, 658, 661, 672, 674, 487, 527, 485, 411, 406, 466, 464, 462, 461, 460, 429, 459, 409, 455, 449, 403, 415, 438, 418, 419, 436, 435, 434, 433, 484, 431, 469, 408, 402, 393, 475, 481, 400, 399, 401, 398, 471, 470, 424, 9, 745, 54, 337, 7, 427, 666, 420, 59, 4, 47, 422, 426, 667, 48, 13, 12, 729, 23, 367, 21, 365, 20, 376, 736, 379, 711, 705, 35, 362, 743, 38, 14, 358, 355, 687, 685, 335, 97, 331, 515, 467, 269, 265, 181, 182, 186, 260, 519, 518, 197, 258, 514, 329, 257, 255, 474, 253, 506, 483, 245, 221, 243, 241, 240, 278, 151, 458, 571, 644, 328, 430, 326, 627, 96, 624, 321, 437, 314, 603, 307, 446, 303, 594, 592, 301, 300, 297, 579, 577, 144, 457, 338]
ds_hcc_top_pcs_occurrences = count_and_sort_occurrences([
ds_hcc_pca_logit_pcs,
ds_hcc_pca_svm_pcs,
ds_hcc_pca_random_forest_pcs,
], False)
ds_hcc_top_pcs = {}
for i in range(3, 0, -1):
ds_hcc_top_pcs[i] = ds_hcc_top_pcs.get(i + 1, []) + filter_by_occurrences(ds_hcc_top_pcs_occurrences, i)
for i in range(3, 0, -1):
if len(ds_hcc_top_pcs[i]) == 0:
continue
print(f"{len(ds_hcc_top_pcs[i])} PC(s) selected for DropSeq HCC across {i}+ models(s):")
print(ds_hcc_top_pcs[i])
print()
69 PC(s) selected for DropSeq HCC across 3+ models(s): [187, 190, 201, 301, 198, 197, 99, 26, 24, 23, 21, 30, 189, 20, 19, 18, 102, 184, 182, 63, 208, 31, 176, 47, 259, 41, 39, 46, 38, 240, 262, 37, 48, 290, 90, 49, 270, 53, 36, 54, 34, 76, 15, 45, 11, 141, 147, 127, 8, 69, 155, 6, 142, 106, 5, 103, 161, 145, 65, 167, 12, 4, 169, 170, 172, 3, 2, 72, 174] 298 PC(s) selected for DropSeq HCC across 2+ models(s): [187, 190, 201, 301, 198, 197, 99, 26, 24, 23, 21, 30, 189, 20, 19, 18, 102, 184, 182, 63, 208, 31, 176, 47, 259, 41, 39, 46, 38, 240, 262, 37, 48, 290, 90, 49, 270, 53, 36, 54, 34, 76, 15, 45, 11, 141, 147, 127, 8, 69, 155, 6, 142, 106, 5, 103, 161, 145, 65, 167, 12, 4, 169, 170, 172, 3, 2, 72, 174, 77, 351, 350, 356, 266, 319, 265, 357, 358, 263, 345, 261, 359, 260, 360, 80, 272, 294, 275, 282, 315, 313, 312, 325, 310, 255, 326, 329, 308, 842, 332, 334, 303, 336, 300, 297, 295, 318, 292, 339, 341, 843, 219, 254, 153, 183, 181, 180, 179, 178, 177, 175, 171, 166, 162, 157, 154, 148, 193, 143, 140, 139, 137, 136, 135, 115, 117, 131, 126, 118, 124, 191, 96, 88, 227, 249, 247, 89, 243, 239, 238, 237, 235, 234, 231, 230, 229, 225, 200, 224, 221, 220, 369, 218, 217, 215, 213, 92, 210, 94, 205, 362, 391, 371, 672, 640, 641, 642, 27, 651, 657, 691, 577, 692, 693, 698, 700, 701, 16, 637, 29, 630, 622, 613, 612, 610, 609, 603, 32, 600, 598, 594, 372, 584, 583, 582, 713, 717, 718, 785, 835, 831, 7, 814, 813, 810, 809, 808, 807, 799, 794, 793, 792, 789, 784, 730, 783, 781, 10, 777, 769, 766, 763, 761, 760, 758, 756, 749, 747, 746, 581, 586, 576, 413, 450, 448, 575, 447, 445, 444, 55, 439, 438, 60, 427, 420, 415, 414, 412, 453, 409, 408, 401, 399, 396, 120, 384, 383, 380, 379, 377, 375, 374, 373, 451, 123, 509, 563, 457, 562, 525, 551, 521, 516, 515, 540, 550, 507, 561, 503, 564, 566, 494, 490, 567, 479, 474, 461, 528] 400 PC(s) selected for DropSeq HCC across 1+ models(s): [187, 190, 201, 301, 198, 197, 99, 26, 24, 23, 21, 30, 189, 20, 19, 18, 102, 184, 182, 63, 208, 31, 176, 47, 259, 41, 39, 46, 38, 240, 262, 37, 48, 290, 90, 49, 270, 53, 36, 54, 34, 76, 15, 45, 11, 141, 147, 127, 8, 69, 155, 6, 142, 106, 5, 103, 161, 145, 65, 167, 12, 4, 169, 170, 172, 3, 2, 72, 174, 77, 351, 350, 356, 266, 319, 265, 357, 358, 263, 345, 261, 359, 260, 360, 80, 272, 294, 275, 282, 315, 313, 312, 325, 310, 255, 326, 329, 308, 842, 332, 334, 303, 336, 300, 297, 295, 318, 292, 339, 341, 843, 219, 254, 153, 183, 181, 180, 179, 178, 177, 175, 171, 166, 162, 157, 154, 148, 193, 143, 140, 139, 137, 136, 135, 115, 117, 131, 126, 118, 124, 191, 96, 88, 227, 249, 247, 89, 243, 239, 238, 237, 235, 234, 231, 230, 229, 225, 200, 224, 221, 220, 369, 218, 217, 215, 213, 92, 210, 94, 205, 362, 391, 371, 672, 640, 641, 642, 27, 651, 657, 691, 577, 692, 693, 698, 700, 701, 16, 637, 29, 630, 622, 613, 612, 610, 609, 603, 32, 600, 598, 594, 372, 584, 583, 582, 713, 717, 718, 785, 835, 831, 7, 814, 813, 810, 809, 808, 807, 799, 794, 793, 792, 789, 784, 730, 783, 781, 10, 777, 769, 766, 763, 761, 760, 758, 756, 749, 747, 746, 581, 586, 576, 413, 450, 448, 575, 447, 445, 444, 55, 439, 438, 60, 427, 420, 415, 414, 412, 453, 409, 408, 401, 399, 396, 120, 384, 383, 380, 379, 377, 375, 374, 373, 451, 123, 509, 563, 457, 562, 525, 551, 521, 516, 515, 540, 550, 507, 561, 503, 564, 566, 494, 490, 567, 479, 474, 461, 528, 17, 70, 9, 35, 73, 74, 42, 44, 68, 86, 110, 109, 95, 75, 97, 56, 100, 13, 78, 14, 305, 121, 510, 656, 653, 646, 644, 632, 608, 604, 601, 599, 597, 593, 565, 553, 543, 541, 539, 534, 667, 674, 685, 812, 841, 839, 832, 829, 828, 820, 815, 778, 689, 771, 755, 733, 716, 714, 705, 699, 523, 499, 125, 497, 250, 245, 233, 212, 207, 202, 199, 196, 195, 192, 185, 159, 156, 152, 151, 134, 133, 253, 257, 267, 403, 488, 466, 452, 443, 434, 429, 423, 402, 269, 393, 382, 344, 338, 323, 317, 283, 1]
Models trained on selected genes¶
To generalize the models and make them more robust, new ones can be trained on various subsets of the selected genes. The hope is to produce models which train faster while still attaining equal or greater accuracy. This can even lead to a generalized model which is agnostic with respect to the data set.
top_genes_ss_mcf7 = {i: list(filter(lambda x: x in X_ss_mcf7.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ss_mcf7 = {
i: X_ss_mcf7.loc[:, top_genes_ss_mcf7[i]] for i in range(1, 13)
}
top_genes_ss_hcc = {i: list(filter(lambda x: x in X_ss_hcc.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ss_hcc = {
i: X_ss_hcc.loc[:, top_genes_ss_hcc[i]] for i in range(1, 13)
}
top_genes_ds_mcf7 = {i: list(filter(lambda x: x in X_ds_mcf7.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ds_mcf7 = {
i: X_ds_mcf7.loc[:, top_genes_ds_mcf7[i]] for i in range(1, 13)
}
top_genes_ds_hcc = {i: list(filter(lambda x: x in X_ds_hcc.columns, top_genes[i])) for i in range (1, 13)}
X_top_genes_ds_hcc = {
i: X_ds_hcc.loc[:, top_genes_ds_hcc[i]] for i in range(1, 13)
}
for i, genes in top_genes_ss_mcf7.items():
print(f"{len(genes)} genes selected {i}+ times for SmartSeq MCF7")
print()
for i, genes in top_genes_ss_hcc.items():
print(f"{len(genes)} genes selected {i}+ times for SmartSeq HCC")
print()
for i, genes in top_genes_ds_mcf7.items():
print(f"{len(genes)} genes selected {i}+ times for DropSeq MCF7")
print()
for i, genes in top_genes_ds_hcc.items():
print(f"{len(genes)} genes selected {i}+ times for DropSeq HCC")
print()
302 genes selected 1+ times for SmartSeq MCF7 195 genes selected 2+ times for SmartSeq MCF7 76 genes selected 3+ times for SmartSeq MCF7 47 genes selected 4+ times for SmartSeq MCF7 30 genes selected 5+ times for SmartSeq MCF7 17 genes selected 6+ times for SmartSeq MCF7 9 genes selected 7+ times for SmartSeq MCF7 6 genes selected 8+ times for SmartSeq MCF7 3 genes selected 9+ times for SmartSeq MCF7 2 genes selected 10+ times for SmartSeq MCF7 1 genes selected 11+ times for SmartSeq MCF7 1 genes selected 12+ times for SmartSeq MCF7 279 genes selected 1+ times for SmartSeq HCC 160 genes selected 2+ times for SmartSeq HCC 63 genes selected 3+ times for SmartSeq HCC 40 genes selected 4+ times for SmartSeq HCC 26 genes selected 5+ times for SmartSeq HCC 15 genes selected 6+ times for SmartSeq HCC 8 genes selected 7+ times for SmartSeq HCC 7 genes selected 8+ times for SmartSeq HCC 3 genes selected 9+ times for SmartSeq HCC 2 genes selected 10+ times for SmartSeq HCC 1 genes selected 11+ times for SmartSeq HCC 1 genes selected 12+ times for SmartSeq HCC 554 genes selected 1+ times for DropSeq MCF7 507 genes selected 2+ times for DropSeq MCF7 112 genes selected 3+ times for DropSeq MCF7 74 genes selected 4+ times for DropSeq MCF7 31 genes selected 5+ times for DropSeq MCF7 13 genes selected 6+ times for DropSeq MCF7 8 genes selected 7+ times for DropSeq MCF7 6 genes selected 8+ times for DropSeq MCF7 3 genes selected 9+ times for DropSeq MCF7 2 genes selected 10+ times for DropSeq MCF7 1 genes selected 11+ times for DropSeq MCF7 1 genes selected 12+ times for DropSeq MCF7 578 genes selected 1+ times for DropSeq HCC 507 genes selected 2+ times for DropSeq HCC 136 genes selected 3+ times for DropSeq HCC 94 genes selected 4+ times for DropSeq HCC 44 genes selected 5+ times for DropSeq HCC 20 genes selected 6+ times for DropSeq HCC 10 genes selected 7+ times for DropSeq HCC 7 genes selected 8+ times for DropSeq HCC 3 genes selected 9+ times for DropSeq HCC 2 genes selected 10+ times for DropSeq HCC 1 genes selected 11+ times for DropSeq HCC 1 genes selected 12+ times for DropSeq HCC
Logistic regression¶
ss_mcf7_top_genes_logit: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_genes_logit: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_genes_logit: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_genes_logit: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in [12, 6, 1]:
ss_mcf7_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9523809523809523 Genes selected across 6+ model(s). Accuracy: 1.0 Genes selected across 1+ model(s). Accuracy: 1.0
SmartSeq HCC
for i in [12, 6, 1]:
ss_hcc_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9782608695652174 Genes selected across 6+ model(s). Accuracy: 0.9782608695652174 Genes selected across 1+ model(s). Accuracy: 0.9782608695652174
DropSeq MCF7
for i in [12, 6, 1]:
ds_mcf7_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677 Genes selected across 6+ model(s). Accuracy: 0.8823746994636582 Genes selected across 1+ model(s). Accuracy: 0.9774366561864251
DropSeq HCC
for i in [12, 6, 1]:
ds_hcc_top_genes_logit[i] = train_test_logistic_regression(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_logit[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.6061018795968401 Genes selected across 6+ model(s). Accuracy: 0.8136747480250613 Genes selected across 1+ model(s). Accuracy: 0.9476981748842277
SVM¶
ss_mcf7_top_genes_svm: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_genes_svm: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_genes_svm: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_genes_svm: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in [12, 6, 1]:
ss_mcf7_top_genes_svm[i] = train_test_svm(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9365079365079365 Genes selected across 6+ model(s). Accuracy: 1.0 Genes selected across 1+ model(s). Accuracy: 1.0
SmartSeq HCC
for i in [12, 6, 1]:
ss_hcc_top_genes_svm[i] = train_test_svm(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9782608695652174 Genes selected across 6+ model(s). Accuracy: 0.9565217391304348 Genes selected across 1+ model(s). Accuracy: 0.9782608695652174
DropSeq MCF7
for i in [12, 6, 1]:
ds_mcf7_top_genes_svm[i] = train_test_svm(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677 Genes selected across 6+ model(s). Accuracy: 0.8853338265211762 Genes selected across 1+ model(s). Accuracy: 0.9772517107453301
DropSeq HCC
for i in [12, 6, 1]:
ds_hcc_top_genes_svm[i] = train_test_svm(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_svm[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.6061018795968401 Genes selected across 6+ model(s). Accuracy: 0.8136747480250613 Genes selected across 1+ model(s). Accuracy: 0.9506946336148189
Random forest¶
ss_mcf7_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_genes_random_forest: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in [12, 6, 1]:
ss_mcf7_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9365079365079365 Genes selected across 6+ model(s). Accuracy: 1.0 Genes selected across 1+ model(s). Accuracy: 1.0
SmartSeq HCC
for i in [12, 6, 1]:
ss_hcc_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.9782608695652174 Genes selected across 6+ model(s). Accuracy: 0.9782608695652174 Genes selected across 1+ model(s). Accuracy: 0.9782608695652174
DropSeq MCF7
for i in [12, 6, 1]:
ds_mcf7_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677 Genes selected across 6+ model(s). Accuracy: 0.884224153874607 Genes selected across 1+ model(s). Accuracy: 0.9726280747179582
DropSeq HCC
for i in [12, 6, 1]:
ds_hcc_top_genes_random_forest[i] = train_test_random_forest(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_random_forest[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.48597112503405065 Genes selected across 6+ model(s). Accuracy: 0.8193952601470988 Genes selected across 1+ model(s). Accuracy: 0.9406156360664669
Multilayer perceptron¶
ss_mcf7_top_genes_mlp: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_genes_mlp: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_genes_mlp: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_genes_mlp: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in [12, 6, 1]:
ss_mcf7_top_genes_mlp[i] = train_test_mlp(X_top_genes_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_mcf7_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.5079365079365079 Genes selected across 6+ model(s). Accuracy: 0.9206349206349206 Genes selected across 1+ model(s). Accuracy: 0.9682539682539683
SmartSeq HCC
for i in [12, 6, 1]:
ss_hcc_top_genes_mlp[i] = train_test_mlp(X_top_genes_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ss_hcc_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.5434782608695652 Genes selected across 6+ model(s). Accuracy: 0.9130434782608695 Genes selected across 1+ model(s). Accuracy: 0.9347826086956522
DropSeq MCF7
for i in [12, 6, 1]:
ds_mcf7_top_genes_mlp[i] = train_test_mlp(X_top_genes_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_mcf7_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.726465692620677 Genes selected across 6+ model(s). Accuracy: 0.8875531718143148 Genes selected across 1+ model(s). Accuracy: 0.980395783243943
DropSeq HCC
for i in [12, 6, 1]:
ds_hcc_top_genes_mlp[i] = train_test_mlp(X_top_genes_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"Genes selected across {i}+ model(s). Accuracy: {ds_hcc_top_genes_mlp[i].accuracy}")
Genes selected across 12+ model(s). Accuracy: 0.6061018795968401 Genes selected across 6+ model(s). Accuracy: 0.8163988014165078 Genes selected across 1+ model(s). Accuracy: 0.9504222282756742
Models trained on selected PCs¶
X_top_pcs_ss_mcf7 = {i: X_pca_ss_mcf7[:, ss_mcf7_top_pcs[i]] for i in range(1, 4)}
X_top_pcs_ss_hcc = {i: X_pca_ss_hcc[:, ss_hcc_top_pcs[i]] for i in range(1, 4)}
X_top_pcs_ds_mcf7 = {i: X_pca_ds_mcf7[:, ds_mcf7_top_pcs[i]] for i in range(1, 4)}
X_top_pcs_ds_hcc = {i: X_pca_ds_hcc[:, ds_hcc_top_pcs[i]] for i in range(1, 4)}
Logistic regression¶
ss_mcf7_top_pcs_logit: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_pcs_logit: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_pcs_logit: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_pcs_logit: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in range(3, 0, -1):
ss_mcf7_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.5079365079365079 PCs selected across 2+ model(s). Accuracy: 0.5238095238095238 PCs selected across 1+ model(s). Accuracy: 0.7619047619047619
SmartSeq HCC
for i in range(3, 0, -1):
ss_hcc_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9130434782608695
PCs selected across 2+ model(s). Accuracy: 0.9347826086956522 PCs selected across 1+ model(s). Accuracy: 0.8913043478260869
DropSeq MCF7
for i in range(3, 0, -1):
ds_mcf7_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9258368781209543 PCs selected across 2+ model(s). Accuracy: 0.9315701867948956
PCs selected across 1+ model(s). Accuracy: 0.9408174588496394
DropSeq HCC
for i in range(3, 0, -1):
ds_hcc_top_pcs_logit[i] = train_test_logistic_regression(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_logit[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8673385998365568 PCs selected across 2+ model(s). Accuracy: 0.8921274856987197 PCs selected across 1+ model(s). Accuracy: 0.9199128302914737
SVM¶
ss_mcf7_top_pcs_svm: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_pcs_svm: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_pcs_svm: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_pcs_svm: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in range(3, 0, -1):
ss_mcf7_top_pcs_svm[i] = train_test_svm(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.49206349206349204 PCs selected across 2+ model(s). Accuracy: 0.47619047619047616 PCs selected across 1+ model(s). Accuracy: 0.746031746031746
SmartSeq HCC
for i in range(3, 0, -1):
ss_hcc_top_pcs_svm[i] = train_test_svm(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9347826086956522 PCs selected across 2+ model(s). Accuracy: 0.9347826086956522 PCs selected across 1+ model(s). Accuracy: 0.8695652173913043
DropSeq MCF7
for i in range(3, 0, -1):
ds_mcf7_top_pcs_svm[i] = train_test_svm(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9238024782689107 PCs selected across 2+ model(s). Accuracy: 0.926391714444239 PCs selected across 1+ model(s). Accuracy: 0.9371185500277418
DropSeq HCC
for i in range(3, 0, -1):
ds_hcc_top_pcs_svm[i] = train_test_svm(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_svm[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8662489784799782 PCs selected across 2+ model(s). Accuracy: 0.8902206483247072 PCs selected across 1+ model(s). Accuracy: 0.9212748569871969
Random forest¶
ss_mcf7_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_pcs_random_forest: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in range(3, 0, -1):
ss_mcf7_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9206349206349206 PCs selected across 2+ model(s). Accuracy: 0.9365079365079365 PCs selected across 1+ model(s). Accuracy: 0.9841269841269841
SmartSeq HCC
for i in range(3, 0, -1):
ss_hcc_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9347826086956522 PCs selected across 2+ model(s). Accuracy: 0.9347826086956522 PCs selected across 1+ model(s). Accuracy: 0.9347826086956522
DropSeq MCF7
for i in range(3, 0, -1):
ds_mcf7_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9326798594414648 PCs selected across 2+ model(s). Accuracy: 0.9343443684113186 PCs selected across 1+ model(s). Accuracy: 0.9361938228222675
DropSeq HCC
for i in range(3, 0, -1):
ds_hcc_top_pcs_random_forest[i] = train_test_random_forest(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_random_forest[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8703350585671479 PCs selected across 2+ model(s). Accuracy: 0.8714246799237265 PCs selected across 1+ model(s). Accuracy: 0.8855897575592482
Multilayer perceptron¶
ss_mcf7_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}
ss_hcc_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}
ds_mcf7_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}
ds_hcc_top_pcs_mlp: dict[int, TrainedModelWrapper] = {}
SmartSeq MCF7
for i in range(3, 0, -1):
ss_mcf7_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ss_mcf7[i], y_ss_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_mcf7_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.7619047619047619 PCs selected across 2+ model(s). Accuracy: 0.8571428571428571 PCs selected across 1+ model(s). Accuracy: 0.8095238095238095
SmartSeq HCC
for i in range(3, 0, -1):
ss_hcc_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ss_hcc[i], y_ss_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ss_hcc_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9347826086956522 PCs selected across 2+ model(s). Accuracy: 0.9130434782608695 PCs selected across 1+ model(s). Accuracy: 0.9565217391304348
DropSeq MCF7
for i in range(3, 0, -1):
ds_mcf7_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ds_mcf7[i], y_ds_mcf7, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_mcf7_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.9472905492879601 PCs selected across 2+ model(s). Accuracy: 0.953393748844091 PCs selected across 1+ model(s). Accuracy: 0.9628259663399297
DropSeq HCC
for i in range(3, 0, -1):
ds_hcc_top_pcs_mlp[i] = train_test_mlp(X_top_pcs_ds_hcc[i], y_ds_hcc, n_jobs = -1, verbose = False)
print(f"PCs selected across {i}+ model(s). Accuracy: {ds_hcc_top_pcs_mlp[i].accuracy}")
PCs selected across 3+ model(s). Accuracy: 0.8943067284118769 PCs selected across 2+ model(s). Accuracy: 0.9114682647779897 PCs selected across 1+ model(s). Accuracy: 0.9229092890220648
Individual model comparison¶
def compare_accuracies(logits: dict[int, TrainedModelWrapper], svms: dict[int, TrainedModelWrapper], random_forests: dict[int, TrainedModelWrapper], mlps: dict[int, TrainedModelWrapper], feature_type: str = "feature"):
max_key = max(logits, key = lambda k: logits[k].accuracy)
max_accuracy = logits[max_key].accuracy
print(f"Logistic regression accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")
max_key = max(svms, key = lambda k: svms[k].accuracy)
max_accuracy = svms[max_key].accuracy
print(f"SVM accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")
max_key = max(random_forests, key = lambda k: random_forests[k].accuracy)
max_accuracy = random_forests[max_key].accuracy
print(f"Random forest accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")
max_key = max(mlps, key = lambda k: mlps[k].accuracy)
max_accuracy = mlps[max_key].accuracy
print(f"Multilayer perceptron accuracy: {max_accuracy} with {feature_type}s selected across {max_key}+ models.")
Models trained on selected genes¶
SmartSeq MCF7
compare_accuracies(ss_mcf7_top_genes_logit, ss_mcf7_top_genes_svm, ss_mcf7_top_genes_random_forest, ss_mcf7_top_genes_mlp, "gene")
Logistic regression accuracy: 1.0 with genes selected across 6+ models. SVM accuracy: 1.0 with genes selected across 6+ models. Random forest accuracy: 1.0 with genes selected across 6+ models. Multilayer perceptron accuracy: 0.9841269841269841 with genes selected across 1+ models.
SmartSeq HCC
compare_accuracies(ss_hcc_top_genes_logit, ss_hcc_top_genes_svm, ss_hcc_top_genes_random_forest, ss_hcc_top_genes_mlp, "gene")
Logistic regression accuracy: 0.9782608695652174 with genes selected across 12+ models. SVM accuracy: 0.9782608695652174 with genes selected across 12+ models. Random forest accuracy: 0.9782608695652174 with genes selected across 6+ models. Multilayer perceptron accuracy: 0.8695652173913043 with genes selected across 6+ models.
DropSeq MCF7
compare_accuracies(ds_mcf7_top_genes_logit, ds_mcf7_top_genes_svm, ds_mcf7_top_genes_random_forest, ds_mcf7_top_genes_mlp, "gene")
Logistic regression accuracy: 0.9757721472165711 with genes selected across 1+ models. SVM accuracy: 0.9750323654521916 with genes selected across 1+ models. Random forest accuracy: 0.9726280747179582 with genes selected across 1+ models. Multilayer perceptron accuracy: 0.9783613833918994 with genes selected across 1+ models.
DropSeq HCC
compare_accuracies(ds_hcc_top_genes_logit, ds_hcc_top_genes_svm, ds_hcc_top_genes_random_forest, ds_mcf7_top_genes_mlp, "gene")
Logistic regression accuracy: 0.9512394442931081 with genes selected across 1+ models. SVM accuracy: 0.9515118496322528 with genes selected across 1+ models. Random forest accuracy: 0.9441569054753474 with genes selected across 1+ models. Multilayer perceptron accuracy: 0.9783613833918994 with genes selected across 1+ models.
As expected, the multilayer perceptron does not predict well on the SmartSeq data, as the data set is too small. The three other models perform very well when trained on selected genes for SmartSeq, with no significant differences in accuracy.
On DropSeq, while all the model perform quite well, the multilayer perceptron has a notably higher accuracy for DropSeq HCC.
For the larger data sets, the highest accuracy scores came from training the model on the genes that had been selected at least once by a model, which makes sense since this provides a larger pool of features (all of which have already been selected and have stronger predictive power).
Models trained on PCA-encoded data¶
SmartSeq MCF7
print("Logistic regression accuracy:", ss_mcf7_pca_logit.accuracy)
print("SVM accuracy:", ss_mcf7_pca_svm.accuracy)
print("Random forest accuracy:", ss_mcf7_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ss_mcf7_pca_mlp.accuracy)
Logistic regression accuracy: 1.0 SVM accuracy: 1.0 Random forest accuracy: 1.0 Multilayer perceptron accuracy: 0.9682539682539683
SmartSeq HCC
print("Logistic regression accuracy:", ss_hcc_pca_logit.accuracy)
print("SVM accuracy:", ss_hcc_pca_svm.accuracy)
print("Random forest accuracy:", ss_hcc_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ss_hcc_pca_mlp.accuracy)
Logistic regression accuracy: 0.9782608695652174 SVM accuracy: 0.9782608695652174 Random forest accuracy: 0.9782608695652174 Multilayer perceptron accuracy: 0.8913043478260869
DropSeq MCF7
print("Logistic regression accuracy:", ds_mcf7_pca_logit.accuracy)
print("SVM accuracy:", ds_mcf7_pca_svm.accuracy)
print("Random forest accuracy:", ds_mcf7_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ds_mcf7_pca_mlp.accuracy)
Logistic regression accuracy: 0.975957092657666 SVM accuracy: 0.9755872017754762 Random forest accuracy: 0.926206769003144 Multilayer perceptron accuracy: 0.9776216016275199
DropSeq HCC
print("Logistic regression accuracy:", ds_hcc_pca_logit.accuracy)
print("SVM accuracy:", ds_hcc_pca_svm.accuracy)
print("Random forest accuracy:", ds_hcc_pca_random_forest.accuracy)
print("Multilayer perceptron accuracy:", ds_hcc_pca_mlp.accuracy)
Logistic regression accuracy: 0.9566875510760011 SVM accuracy: 0.954508308362844 Random forest accuracy: 0.8948515390901661 Multilayer perceptron accuracy: 0.9591391991283029
All data sets
print("Logistic regression accuracy:", np.mean([ss_mcf7_pca_logit.accuracy, ss_hcc_pca_logit.accuracy, ds_mcf7_pca_logit.accuracy, ds_hcc_pca_logit.accuracy]))
print("SVM accuracy:", np.mean([ss_mcf7_pca_svm.accuracy, ss_hcc_pca_svm.accuracy, ds_mcf7_pca_svm.accuracy, ds_hcc_pca_svm.accuracy]))
print("Random forest accuracy:", np.mean([ss_mcf7_pca_random_forest.accuracy, ss_hcc_pca_random_forest.accuracy, ds_mcf7_pca_random_forest.accuracy, ds_hcc_pca_random_forest.accuracy]))
print("Multilayer perceptron accuracy:", np.mean([ss_mcf7_pca_mlp.accuracy, ss_hcc_pca_mlp.accuracy, ds_mcf7_pca_mlp.accuracy, ds_hcc_pca_mlp.accuracy]))
Logistic regression accuracy: 0.9777263783247212 SVM accuracy: 0.9770890949258844 Random forest accuracy: 0.9498297944146319 Multilayer perceptron accuracy: 0.9490797792089696
Logistic regression and SVM perform the best on SmartSeq HCC. Where as the multilayer perceptrons are best on both DropSeq data sets.
Models trained on PCA-encoded data, with feature selection¶
SmartSeq MCF7
compare_accuracies(ss_mcf7_top_pcs_logit, ss_mcf7_top_pcs_svm, ss_mcf7_top_pcs_random_forest, ss_mcf7_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.7936507936507936 with PCs selected across 2+ models. SVM accuracy: 0.7619047619047619 with PCs selected across 2+ models. Random forest accuracy: 0.9523809523809523 with PCs selected across 2+ models. Multilayer perceptron accuracy: 0.8888888888888888 with PCs selected across 2+ models.
SmartSeq HCC
compare_accuracies(ss_hcc_top_pcs_logit, ss_hcc_top_pcs_svm, ss_hcc_top_pcs_random_forest, ss_hcc_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.9347826086956522 with PCs selected across 3+ models. SVM accuracy: 0.9130434782608695 with PCs selected across 3+ models. Random forest accuracy: 0.9130434782608695 with PCs selected across 2+ models. Multilayer perceptron accuracy: 0.9130434782608695 with PCs selected across 3+ models.
DropSeq MCF7
compare_accuracies(ds_mcf7_top_pcs_logit, ds_mcf7_top_pcs_svm, ds_mcf7_top_pcs_random_forest, ds_mcf7_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.9347142592935084 with PCs selected across 1+ models. SVM accuracy: 0.9295357869428519 with PCs selected across 1+ models. Random forest accuracy: 0.9374884409099316 with PCs selected across 2+ models. Multilayer perceptron accuracy: 0.9515442944331423 with PCs selected across 1+ models.
DropSeq HCC
compare_accuracies(ds_hcc_top_pcs_logit, ds_hcc_top_pcs_svm, ds_hcc_top_pcs_random_forest, ds_hcc_top_pcs_mlp, "PC")
Logistic regression accuracy: 0.925905747752656 with PCs selected across 1+ models. SVM accuracy: 0.925088531735222 with PCs selected across 1+ models. Random forest accuracy: 0.892672296377009 with PCs selected across 1+ models. Multilayer perceptron accuracy: 0.9299918278398257 with PCs selected across 1+ models.
While some of the accuracies are not bad, these models trained on selected principal components do not perform as well as their counterparts. This observation is logical as PCA components obfuscate the meaning of the original features, so feature selection doesn't have a meaningful impact.
Ensemble models¶
In search of creating a generalized model with higher accuracy, multiple approaches to hybridization of the individual models will be examined.
Simple majority vote¶
The simplest ensemble model simply takes the predictions from the provided models and takes the majority vote.
class SimpleMajorityVoteClassifier():
def __init__(self, models: list[TrainedModelWrapper]):
"""
models: list of pretrained classifiers with .predict() method
"""
self.models = models
test_sets = [(model.X_test, model.y_test) for model in self.models]
self.assert_test_sets_equal(test_sets)
self.X_test = models[0].X_test
self.y_test = models[0].y_test
self.features = self.X_test.columns
self.accuracy = None
def predict(self, X):
if missing := set(self.features) - set(X.columns):
raise ValueError(f"Missing columns:", missing)
predictions = [model.predict(X) for model in self.models]
predictions = np.vstack(predictions)
predictions_T = predictions.T
majority_votes = []
for sample_preds in predictions_T:
values, counts = np.unique(sample_preds, return_counts = True)
best_label = values[np.argmax(counts)]
majority_votes.append(best_label)
return np.array(majority_votes)
def assert_test_sets_equal(self, test_sets: list):
ref_X, ref_y = test_sets[0]
for i, (X, y) in enumerate(test_sets[1:], start=1):
if not (np.array_equal(ref_X, X) and np.array_equal(ref_y, y)):
raise ValueError(f"Test set {i} does not match the reference test set.")
def test(self, X_test: Any | None = None, y_test: Any | None = None, verbose: bool = True):
if not X_test or not y_test:
X_test = self.X_test
y_test = self.y_test
self.accuracy = test_model(self, X_test, y_test, verbose)
return self.accuracy
Since the multilayer perceptron does not work well with small data sets, it is omitted from the SmartSeq ensembles.
ss_mcf7_simple_ensemble = SimpleMajorityVoteClassifier([
ss_mcf7_top_genes_logit[1],
ss_mcf7_top_genes_svm[1],
ss_mcf7_top_genes_random_forest[1]
])
ss_mcf7_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 31 0
Actual Norm 0 32
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 31
Norm 1.00 1.00 1.00 32
accuracy 1.00 63
macro avg 1.00 1.00 1.00 63
weighted avg 1.00 1.00 1.00 63
1.0
ss_hcc_simple_ensemble = SimpleMajorityVoteClassifier([
ss_hcc_top_genes_logit[1],
ss_hcc_top_genes_svm[1],
ss_hcc_top_genes_random_forest[1]
])
ss_hcc_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 24 1
Actual Norm 0 21
Accuracy: 0.9782608695652174
Classification report:
precision recall f1-score support
Hypo 1.00 0.96 0.98 25
Norm 0.95 1.00 0.98 21
accuracy 0.98 46
macro avg 0.98 0.98 0.98 46
weighted avg 0.98 0.98 0.98 46
0.9782608695652174
ds_mcf7_simple_ensemble = SimpleMajorityVoteClassifier([
ds_mcf7_top_genes_logit[1],
ds_mcf7_top_genes_svm[1],
ds_mcf7_top_genes_random_forest[1],
ds_mcf7_top_genes_mlp[1]
])
ds_mcf7_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2164 66
Actual Norm 49 3128
Accuracy: 0.9787312742740891
Classification report:
precision recall f1-score support
Hypo 0.98 0.97 0.97 2230
Norm 0.98 0.98 0.98 3177
accuracy 0.98 5407
macro avg 0.98 0.98 0.98 5407
weighted avg 0.98 0.98 0.98 5407
0.9787312742740891
ds_hcc_simple_ensemble = SimpleMajorityVoteClassifier([
ds_hcc_top_genes_logit[1],
ds_hcc_top_genes_svm[1],
ds_hcc_top_genes_random_forest[1],
ds_hcc_top_genes_mlp[1]
])
ds_hcc_simple_ensemble.test()
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 2126 99
Actual Norm 77 1369
Accuracy: 0.952056660310542
Classification report:
precision recall f1-score support
Hypo 0.97 0.96 0.96 2225
Norm 0.93 0.95 0.94 1446
accuracy 0.95 3671
macro avg 0.95 0.95 0.95 3671
weighted avg 0.95 0.95 0.95 3671
0.952056660310542
Weighted majority vote¶
A slightly improved version of the simple majority vote model is a weighted majority vote model, where the predictions are weighted by the individual model's test accuracy. In theory, this could help improve accuracy ensuring that higher quality votes are worth more.
However, in practice, to test this model would require a different subdivision of the original data set to have a training set for the individual models, a test set for the individual models to obtain their accuracies, and a second test set for the hybrid model to verify its effectiveness. Given the computational complexity of retraining all of the models on a different split of the data set, this model will not be tested but instead stands as a potential improvement.
class WeightedMajorityVoteClassifier():
def __init__(self, models: list[TrainedModelWrapper]):
"""
models: list of pretrained classifiers with .predict() method
"""
self.models = models
test_sets = [(model.X_test, model.y_test) for model in self.models]
self.assert_test_sets_equal(test_sets)
self.X_test = models[0].X_test
self.y_test = models[0].y_test
self.features = self.X_test.columns
self.accuracy = None
def predict(self, X):
if missing := set(self.features) - set(X.columns):
raise ValueError(f"Missing columns:", missing)
predictions = [model.predict(X) for model in self.models]
predictions = np.vstack(predictions)
weights = np.array([model.accuracy for model in self.models])
predictions_T = predictions.T
weighted_votes = []
for sample_preds in predictions_T:
unique_labels = np.unique(sample_preds)
label_weights = {label: 0.0 for label in unique_labels}
for pred, weight in zip(sample_preds, weights):
label_weights[pred] += weight
best_label = max(label_weights, key = label_weights.get)
weighted_votes.append(best_label)
return np.array(weighted_votes)
def assert_test_sets_equal(self, test_sets: list):
ref_X, ref_y = test_sets[0]
for i, (X, y) in enumerate(test_sets[1:], start=1):
if not (np.array_equal(ref_X, X) and np.array_equal(ref_y, y)):
raise ValueError(f"Test set {i} does not match the reference test set.")
def test(self, X_test: Any | None = None, y_test: Any | None = None, verbose: bool = True):
if not X_test or not y_test:
X_test = self.X_test
y_test = self.y_test
self.accuracy = test_model(self, X_test, y_test, verbose)
return self.accuracy
Generalized majority vote¶
The ensemble model can be generalized to be agnostic to the data set by comparing the features in the input and the selected features of the various models, then conditionally using certain models depending on the correct subset of features. This generalization over a subset of genes with high predictive power on different datasets produces a very robust model.
Again, due to the computational complexity and data restrictions of dividing the data set into more partitions, this model does not weigh the predictions. However, given sufficient computational power and data, an improved model could potentially be developed by ensembling the WeightedMajorityVoteClassifiers.
class GeneralizedMajorityVoteClassifier():
def __init__(self, models: list[SimpleMajorityVoteClassifier]):
"""
models: list of pretrained classifiers with .predict() method
"""
self.models = models
self.test_sets = [(model.X_test, model.y_test) for model in self.models]
self.features = [X.columns for X, _ in self.test_sets]
def predict(self, X):
input_features = set(X.columns)
predictions = [model.predict(X.loc[:, model.features]) for model in self.models if not set(model.features) - input_features]
predictions = np.vstack(predictions)
predictions_T = predictions.T
majority_votes = []
for sample_preds in predictions_T:
values, counts = np.unique(sample_preds, return_counts = True)
best_label = values[np.argmax(counts)]
majority_votes.append(best_label)
return np.array(majority_votes)
def test(self, X_test, y_test, verbose: bool = True):
return test_model(self, X_test, y_test, verbose)
generalized_classifier = GeneralizedMajorityVoteClassifier([ss_mcf7_simple_ensemble, ss_hcc_simple_ensemble, ds_mcf7_simple_ensemble, ds_hcc_simple_ensemble])
generalized_classifier_scores = {}
generalized_classifier_scores["ss_mcf7"] = generalized_classifier.test(X_ss_mcf7, y_ss_mcf7)
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 124 0
Actual Norm 0 126
Accuracy: 1.0
Classification report:
precision recall f1-score support
Hypo 1.00 1.00 1.00 124
Norm 1.00 1.00 1.00 126
accuracy 1.00 250
macro avg 1.00 1.00 1.00 250
weighted avg 1.00 1.00 1.00 250
generalized_classifier_scores["ss_hcc"] = generalized_classifier.test(X_ss_hcc, y_ss_hcc)
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 96 1
Actual Norm 0 85
Accuracy: 0.9945054945054945
Classification report:
precision recall f1-score support
Hypo 1.00 0.99 0.99 97
Norm 0.99 1.00 0.99 85
accuracy 0.99 182
macro avg 0.99 0.99 0.99 182
weighted avg 0.99 0.99 0.99 182
generalized_classifier_scores["ds_mcf7"] = generalized_classifier.test(X_ds_mcf7, y_ds_mcf7)
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 8806 115
Actual Norm 107 12598
Accuracy: 0.9897345787478036
Classification report:
precision recall f1-score support
Hypo 0.99 0.99 0.99 8921
Norm 0.99 0.99 0.99 12705
accuracy 0.99 21626
macro avg 0.99 0.99 0.99 21626
weighted avg 0.99 0.99 0.99 21626
generalized_classifier_scores["ds_hcc"] = generalized_classifier.test(X_ds_hcc, y_ds_hcc)
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 8721 178
Actual Norm 189 5594
Accuracy: 0.9750034055305816
Classification report:
precision recall f1-score support
Hypo 0.98 0.98 0.98 8899
Norm 0.97 0.97 0.97 5783
accuracy 0.98 14682
macro avg 0.97 0.97 0.97 14682
weighted avg 0.97 0.98 0.97 14682
Final comparisons¶
def plot_accuracies(
data_dict: dict[str, dict[str, float]],
title: str = "Model Accuracies",
ylabel: str = "Accuracy (%)",
xlabel = "Models"
):
colors = plt.cm.tab10.colors
category_color_map = {}
entries = []
# Flatten and store (model_name, accuracy, category)
for i, (category, models) in enumerate(data_dict.items()):
category_color_map[category] = colors[i % len(colors)]
for model_name, acc in models.items():
acc = acc * 100 if 0 <= acc <= 1 else acc
entries.append((model_name, acc, category))
# Sort by accuracy (descending)
entries.sort(key = lambda x: x[1], reverse = True)
# Extract values for plotting
model_labels = [f"{model}" for model, _, _ in entries]
accuracies = [accuracy for _, accuracy, _ in entries]
categories = [category for _, _, category in entries]
bar_colors = [category_color_map[category] for category in categories]
x_positions = np.arange(len(entries))
plt.figure(figsize = (max(10, len(entries) * 0.6), 8))
bars = plt.bar(x_positions, accuracies, color = bar_colors, edgecolor = "black")
# Set Y-limits before placing text
min_y = min(accuracies)
max_y = max(accuracies)
if max_y - min_y < 5:
padding = max(1.0, (max_y - min_y) * 0.2)
else:
padding = (max_y - min_y) * 0.2
plt.ylim(min_y - padding, max_y + padding)
# Add accuracy labels above bars
for x, acc in zip(x_positions, accuracies):
label_y = acc + (padding * 0.2)
plt.text(x, label_y, f"{acc:.1f}%", ha = "center", va = "top", fontsize = 9)
# X-axis
plt.xticks(x_positions, model_labels, rotation = 45, ha = "right")
plt.ylabel(ylabel)
plt.xlabel(xlabel)
plt.title(title)
plt.grid(axis = "y", linestyle = "--", alpha = 0.7)
# Legend
handles = [plt.Rectangle((0, 0), 1, 1, color = category_color_map[category]) for category in category_color_map]
labels = list(category_color_map.keys())
plt.legend(handles, labels, title = "Category")
plt.tight_layout()
plt.show()
Plots¶
plot_accuracies(
{
"PCA": {
"Logistic Regression": ss_mcf7_pca_logit.accuracy,
"SVM": ss_mcf7_pca_svm.accuracy,
"Random Forest": ss_mcf7_pca_random_forest.accuracy,
"Multilayer Perceptron": ss_mcf7_pca_mlp.accuracy
},
"Top PCs": {
"Logistic Regression": ss_mcf7_top_pcs_logit[1].accuracy,
"SVM": ss_mcf7_top_pcs_svm[1].accuracy,
"Random Forest": ss_mcf7_top_pcs_random_forest[1].accuracy,
"Multilayer Perceptron": ss_mcf7_top_pcs_mlp[1].accuracy
},
"Top Genes": {
"Logistic Regression": ss_mcf7_top_genes_logit[1].accuracy,
"SVM": ss_mcf7_top_genes_svm[1].accuracy,
"Random Forest": ss_mcf7_top_genes_random_forest[1].accuracy,
"Multilayer Perceptron": ss_mcf7_top_genes_mlp[1].accuracy
},
"Ensemble": {
"Simple": ss_mcf7_simple_ensemble.accuracy,
"Generalized": generalized_classifier_scores["ss_mcf7"]
}
},
"Model Accuracies: SmartSeq MCF7"
)
plot_accuracies(
{
"PCA": {
"Logistic Regression": ss_hcc_pca_logit.accuracy,
"SVM": ss_hcc_pca_svm.accuracy,
"Random Forest": ss_hcc_pca_random_forest.accuracy,
"Multilayer Perceptron": ss_hcc_pca_mlp.accuracy
},
"Top PCs": {
"Logistic Regression": ss_hcc_top_pcs_logit[1].accuracy,
"SVM": ss_hcc_top_pcs_svm[1].accuracy,
"Random Forest": ss_hcc_top_pcs_random_forest[1].accuracy,
"Multilayer Perceptron": ss_hcc_top_pcs_mlp[1].accuracy
},
"Top Genes": {
"Logistic Regression": ss_hcc_top_genes_logit[1].accuracy,
"SVM": ss_hcc_top_genes_svm[1].accuracy,
"Random Forest": ss_hcc_top_genes_random_forest[1].accuracy,
"Multilayer Perceptron": ss_hcc_top_genes_mlp[1].accuracy
},
"Ensemble": {
"Simple": ss_hcc_simple_ensemble.accuracy,
"Generalized": generalized_classifier_scores["ss_hcc"]
}
},
"Model Accuracies: SmartSeq HCC"
)
plot_accuracies(
{
"PCA": {
"Logistic Regression": ds_mcf7_pca_logit.accuracy,
"SVM": ds_mcf7_pca_svm.accuracy,
"Random Forest": ds_mcf7_pca_random_forest.accuracy,
"Multilayer Perceptron": ds_mcf7_pca_mlp.accuracy
},
"Top PCs": {
"Logistic Regression": ds_mcf7_top_pcs_logit[1].accuracy,
"SVM": ds_mcf7_top_pcs_svm[1].accuracy,
"Random Forest": ds_mcf7_top_pcs_random_forest[1].accuracy,
"Multilayer Perceptron": ds_mcf7_top_pcs_mlp[1].accuracy
},
"Top Genes": {
"Logistic Regression": ds_mcf7_top_genes_logit[1].accuracy,
"SVM": ds_mcf7_top_genes_svm[1].accuracy,
"Random Forest": ds_mcf7_top_genes_random_forest[1].accuracy,
"Multilayer Perceptron": ds_mcf7_top_genes_mlp[1].accuracy
},
"Ensemble": {
"Simple": ds_mcf7_simple_ensemble.accuracy,
"Generalized": generalized_classifier_scores["ds_mcf7"]
}
},
"Model Accuracies: DropSeq MCF7"
)
plot_accuracies(
{
"PCA": {
"Logistic Regression": ds_hcc_pca_logit.accuracy,
"SVM": ds_hcc_pca_svm.accuracy,
"Random Forest": ds_hcc_pca_random_forest.accuracy,
"Multilayer Perceptron": ds_hcc_pca_mlp.accuracy
},
"Top PCs": {
"Logistic Regression": ds_hcc_top_pcs_logit[1].accuracy,
"SVM": ds_hcc_top_pcs_svm[1].accuracy,
"Random Forest": ds_hcc_top_pcs_random_forest[1].accuracy,
"Multilayer Perceptron": ds_hcc_top_pcs_mlp[1].accuracy
},
"Top Genes": {
"Logistic Regression": ds_hcc_top_genes_logit[1].accuracy,
"SVM": ds_hcc_top_genes_svm[1].accuracy,
"Random Forest": ds_hcc_top_genes_random_forest[1].accuracy,
"Multilayer Perceptron": ds_hcc_top_genes_mlp[1].accuracy
},
"Ensemble": {
"Simple": ds_hcc_simple_ensemble.accuracy,
"Generalized": generalized_classifier_scores["ds_hcc"]
}
},
"Model Accuracies: DropSeq HCC"
)
Final Model¶
In all four data sets, the generalized ensemble model performs the best. This model takes the majority vote from the predictions of four simple ensemble models, which themselves take the majority vote of individual classification models which were trained on the top genes identified through feature selection. Logically, this composite ensemble across different models trained on different data sets produces a very robust and generalizable classifier which performs the best. Of the individual models comprising the ensemble, these are their hyperparameters:
dataset_types = ["SmartSeq MCF7", "SmartSeq HCC", "DropSeq MCF7", "DropSeq HCC"]
for i in range(0, 4):
print(f"{dataset_types[i]} =========================================================")
for model in generalized_classifier.models[i].models:
print(model.model)
print()
SmartSeq MCF7 =========================================================
LogisticRegression(C=0.01, max_iter=20000, n_jobs=-1, random_state=10)
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(max_depth=5, n_estimators=25, n_jobs=-1, random_state=10)
SmartSeq HCC =========================================================
LogisticRegression(C=0.01, max_iter=20000, n_jobs=-1, random_state=10)
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(max_depth=5, n_estimators=25, n_jobs=-1, random_state=10)
DropSeq MCF7 =========================================================
LogisticRegression(C=1, max_iter=20000, n_jobs=-1, random_state=10,
solver='sag')
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(class_weight='balanced', max_depth=30, n_estimators=400,
n_jobs=-1, random_state=10)
MLPClassifier(alpha=0.001, early_stopping=True, hidden_layer_sizes=(200,),
max_iter=500, random_state=10)
DropSeq HCC =========================================================
LogisticRegression(C=1, max_iter=20000, n_jobs=-1, random_state=10,
solver='sag')
LinearSVC(C=0.025, max_iter=10000, random_state=10)
RandomForestClassifier(class_weight='balanced', max_depth=20,
min_samples_leaf=2, min_samples_split=10,
n_estimators=300, n_jobs=-1, random_state=10)
MLPClassifier(early_stopping=True, hidden_layer_sizes=(200,), max_iter=500,
random_state=10)
Test predictions¶
ss_mcf7_test = pd.read_csv("AILab2025/SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
ss_hcc_test = pd.read_csv("AILab2025/SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
ds_mcf7_test = pd.read_csv("AILab2025/DropSeq/MCF7_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
ds_hcc_test = pd.read_csv("AILab2025/DropSeq/HCC1806_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter = " ", engine = "python", index_col = 0)
ss_mcf7_test_predictions = generalized_classifier.predict(ss_mcf7_test.T)
ss_hcc_test_predictions = generalized_classifier.predict(ss_hcc_test.T)
ds_mcf7_test_predictions = generalized_classifier.predict(ds_mcf7_test.T)
ds_hcc_test_predictions = generalized_classifier.predict(ds_hcc_test.T)
np.savetxt("test_predictions_ss_mcf7.tsv", ss_mcf7_test_predictions, delimiter = "\t", fmt = "%s")
np.savetxt("test_predictions_ss_hcc.tsv", ss_hcc_test_predictions, delimiter = "\t", fmt = "%s")
np.savetxt("test_predictions_ds_mcf7.tsv", ds_mcf7_test_predictions, delimiter = "\t", fmt = "%s")
np.savetxt("test_predictions_ds_hcc.tsv", ds_hcc_test_predictions, delimiter = "\t", fmt = "%s")
Cross-Dataset Evaluation¶
Set-up¶
# Number of common genes
def common_genes(data1, data2, name1="Dataset 1", name2="Dataset 2"):
common_genes = data1.index.intersection(data2.index)
print(f"Number of common genes in {name1} and {name2}: {len(common_genes)}")
common_genes(ss_mcf7_norm, ss_hcc_norm, "ss_mcf7_norm", "ss_hcc_norm")
common_genes(ds_mcf7_norm, ds_hcc_norm, "ds_mcf7_norm", "ds_hcc_norm")
common_genes(ss_mcf7_norm, ds_mcf7_norm, "ss_mcf7_norm", "ds_mcf7_norm")
common_genes(ss_hcc_norm, ds_hcc_norm, "ss_hcc_norm", "ds_hcc_norm")
Number of common genes in ss_mcf7_norm and ss_hcc_norm: 1208 Number of common genes in ds_mcf7_norm and ds_hcc_norm: 834 Number of common genes in ss_mcf7_norm and ds_mcf7_norm: 496 Number of common genes in ss_hcc_norm and ds_hcc_norm: 516
def train_cross_dataset_classifier(
df_train: pd.DataFrame,
df_test: pd.DataFrame,
model_func: Callable,
test_func: Callable,
random_state: int = 42,
use_pca: bool = False,
pca_var_threshold: float = 0.95,
verbose=True
) -> tuple[ClassifierMixin, np.ndarray, np.ndarray, list, list]:
"""
Trains a classifier on one dataset and evaluates it on another after optional PCA alignment.
Args:
df_train (pd.DataFrame): Training gene expression data (genes x cells).
df_test (pd.DataFrame): Testing gene expression data (genes x cells).
model_func (Callable): Function like train_svm(X_train, y_train, ...) returning a trained model.
test_func (Callable): Function to test the models.
random_state (int): Random seed.
use_pca (bool): Whether to apply PCA before training.
pca_var_threshold (float): Variance threshold for PCA if used.
Returns:
model: Trained model.
X_train, X_test: Input data (PCA-transformed or raw).
y_train, y_test: Labels.
"""
# Align common genes
common_genes = df_train.index.intersection(df_test.index)
df_train_aligned = df_train.loc[common_genes]
df_test_aligned = df_test.loc[common_genes]
assert list(df_train_aligned.index) == list(df_test_aligned.index), "Gene alignment failed"
# Transpose to cells × genes
X_train = df_train_aligned.T
X_test = df_test_aligned.T
# Optionally apply PCA
# Can the axes of variation in X_train explain the structure of X_test?
if use_pca:
pca = PCA(n_components=pca_var_threshold)
X_train = pca.fit_transform(X_train) # Learn principal components from training data
X_test = pca.transform(X_test) # Apply same components to testing data
# Extract labels
y_train = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in df_train_aligned.columns]
y_test = ['Hypo' if 'hypo' in name.lower() else 'Norm' for name in df_test_aligned.columns]
# Train the model
model = model_func(X_train=X_train, y_train=y_train, random_state=random_state, n_jobs=-1, verbose=verbose)
# Evaluate the model
if test_func is not None:
test_func(model=model, X_test=X_test, y_test=y_test, verbose=verbose)
return model, X_train, X_test, y_train, y_test
def select_best_cross_dataset_model(
df_train: pd.DataFrame,
df_test: pd.DataFrame,
train_funcs: list[Callable],
test_func: Callable,
use_pca: bool = False,
pca_var_threshold: float = 0.95,
random_state: int = 42,
verbose=False
):
"""
Trains multiple models cross-dataset and selects the best one by test accuracy.
Args:
df_train, df_test: Gene expression dataframes.
model_pairs: List of (train_func, test_func).
use_pca: Whether to apply PCA.
pca_var_threshold: % variance to retain in PCA.
random_state: Random seed.
Returns:
Prints the best model info and returns (model_name, accuracy).
"""
best_accuracy = -1
best_result = None
best_model_info = {}
for train_func in train_funcs:
model_name = train_func.__name__.replace("train_", "").replace("_", " ").title()
# Wrap train_func to suppress training output
wrapped_train_func = lambda *args, **kwargs: train_func(*args, **{**kwargs, "verbose": False})
model, X_train, X_test, y_train, y_test = train_cross_dataset_classifier(
df_train=df_train,
df_test=df_test,
model_func=wrapped_train_func,
test_func=None, # skip evaluation for now
use_pca=use_pca,
pca_var_threshold=pca_var_threshold,
random_state=random_state,
verbose=False
)
# Compute accuracy manually
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"{model_name} test accuracy: {accuracy:.4f}")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_result = (model_name, accuracy)
best_model_info = {
"train_func": train_func,
"test_func": test_func,
"X_test": X_test,
"y_test": y_test,
"model": model
}
# Re-run training and testing for best model if verbose=True
if verbose:
print(f"\n============ Best Model: {best_result[0]} ============")
best_model_info["train_func"](
X_train=X_train,
y_train=y_train,
random_state=random_state,
n_jobs=-1,
verbose=verbose
)
best_model_info["test_func"](
model=best_model_info["model"],
X_test=best_model_info["X_test"],
y_test=best_model_info["y_test"]
)
return best_result
Independent of Technology¶
# Find the highest accuracy model for MCF7, training on non-reduced Drop-seq data and testing on Smart-seq
select_best_cross_dataset_model(ds_mcf7_norm, ss_mcf7_norm, [train_svm, train_logistic_regression], test_model, use_pca=False)
Svm test accuracy: 0.9680 Logistic Regression test accuracy: 0.9800
('Logistic Regression', 0.98)
# With PCA reduced data
select_best_cross_dataset_model(ds_mcf7_norm, ss_mcf7_norm, [train_svm, train_logistic_regression], test_model, use_pca=True)
Svm test accuracy: 0.9680 Logistic Regression test accuracy: 0.9800
('Logistic Regression', 0.98)
For MCF7, training an SVM and a Logistic Regression model on Drop-seq and testing on Smart-seq, we obtain a test accuracy of 0.968 and 0.980, respectively. This result is plausible because the Drop-seq dataset contains around 20,000 cells, enabling the model to learn a robust and generalizable decision boundary. In contrast, the Smart-seq test set is much smaller (approximately 250 cells). The high accuracy suggests that the transcriptional differences between hypoxic and normoxic states in MCF7 are consistently captured across both technologies, allowing strong cross-platform generalization.
# Find the highest accuracy model for HCC1806, training on non-reduced Drop-seq data and testing on Smart-seq
select_best_cross_dataset_model(ds_hcc_norm, ss_hcc_norm,
[train_svm, train_logistic_regression], test_model, use_pca=False)
Svm test accuracy: 0.7912
Logistic Regression test accuracy: 0.7857
('Svm', 0.7912087912087912)
# With PCA reduced data
select_best_cross_dataset_model(ds_hcc_norm, ss_hcc_norm,
[train_svm, train_logistic_regression], test_model, use_pca=True)
Svm test accuracy: 0.7198 Logistic Regression test accuracy: 0.6978
('Svm', 0.7197802197802198)
# Note precision & recall
model, *_ = train_cross_dataset_classifier(
ds_hcc_norm, ss_hcc_norm,
model_func=train_svm,
test_func=test_model,
)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.8780823960759975
C: 0.025
Penalty: l2
Intercept: [-0.15301313]
Max Iterations: 10000
Number of iterations for convergence: 7
Training accuracy: 0.8914997956681651
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 96 1
Actual Norm 37 48
Accuracy: 0.7912087912087912
Classification report:
precision recall f1-score support
Hypo 0.72 0.99 0.83 97
Norm 0.98 0.56 0.72 85
accuracy 0.79 182
macro avg 0.85 0.78 0.78 182
weighted avg 0.84 0.79 0.78 182
Independent of Cell Line¶
# Find the highest accuracy model for Smart-seq using non-reduced data
select_best_cross_dataset_model(ss_hcc_norm, ss_mcf7_norm,
[train_svm, train_logistic_regression, train_random_forest], test_model, use_pca=False)
Svm test accuracy: 0.9200 Logistic Regression test accuracy: 0.9240 Random Forest test accuracy: 0.8880
('Logistic Regression', 0.924)
# With PCA reduced data
select_best_cross_dataset_model(ss_hcc_norm, ss_mcf7_norm,
[train_svm, train_logistic_regression, train_random_forest], test_model, use_pca=True)
Svm test accuracy: 0.8280 Logistic Regression test accuracy: 0.9560 Random Forest test accuracy: 0.9920
('Random Forest', 0.992)
We get a near perfect score for the Random Forest classifier on PCA-reduced data (0.992), suggesting that the model was able to extract highly generalizable features from the HCC1806 Smart-seq expression profiles that transfer effectively to MCF7 cells, despite differences in cell type. Interestingly, using raw (non-PCA) data, Random Forest becomes the worst performing model (0.888). This contrast implies that dimensionality reduction was crucial in mitigating overfitting and emphasizing informative variance, especially for tree-based models prone to memorizing noisy patterns in high-dimensional input.
On the other hand, SVM performs slightly better without PCA, which is expected in settings where the signal is linearly separable or preserved across many genes. Since PCA transforms the data into a lower-dimensional space by discarding some variance, SVM may lose access to subtle but discriminative features. This highlights a trade-off: while PCA improves generalization for models sensitive to noise (like Random Forest), it can limit the expressive power of models that benefit from the full feature space when regularized properly.
# Find the highest accuracy model for Drop-seq using non-reduced data
select_best_cross_dataset_model(ds_hcc_norm, ds_mcf7_norm,
[train_svm, train_logistic_regression], test_model, use_pca=False)
Svm test accuracy: 0.6995 Logistic Regression test accuracy: 0.7066
('Logistic Regression', 0.7065569222232498)
# With PCA reduced data
select_best_cross_dataset_model(ds_hcc_norm, ds_mcf7_norm,
[train_svm, train_logistic_regression, train_random_forest], test_model, use_pca=True)
Svm test accuracy: 0.7030 Logistic Regression test accuracy: 0.7040 Random Forest test accuracy: 0.6643
('Logistic Regression', 0.7039674465920651)
# Note precision & recall
model, *_ = train_cross_dataset_classifier(
ds_hcc_norm, ds_mcf7_norm,
model_func=train_svm,
test_func=test_model,
)
========================= Training =========================
Best Parameters: {'C': 0.025}
Best Score (CV avg): 0.9119331344241793
C: 0.025
Penalty: l2
Intercept: [-0.28561926]
Max Iterations: 10000
Number of iterations for convergence: 7
Training accuracy: 0.9298460700177088
========================= Testing =========================
Confusion matrix:
Predicted Hypo Predicted Norm
Actual Hypo 7063 1858
Actual Norm 4640 8065
Accuracy: 0.6995283455100342
Classification report:
precision recall f1-score support
Hypo 0.60 0.79 0.68 8921
Norm 0.81 0.63 0.71 12705
accuracy 0.70 21626
macro avg 0.71 0.71 0.70 21626
weighted avg 0.73 0.70 0.70 21626
Summary¶
When training and testing on the same dataset, the models maintain balanced precision and recall, indicating consistent performance in identifying both hypoxic and normoxic cells. However, in cross-dataset evaluations, recall exceeds precision for the hypoxic class, meaning the models effectively identify most hypoxic cells, but at the cost of misclassifying some normoxic cells as hypoxic. Conversely, for normoxic cells, precision is higher than recall, indicating that the models are more conservative in labeling normoxic cells, missing some but rarely misclassifying hypoxic cells as normoxic.